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QUANTITATIVE PSYCHOLOGY AS A RATIONAL SCIENCE 


EDWARD E. CURETON* 
RICHARDSON, BELLOWS, HENRY & CO., INC. 


“The primary purpose of the Psychometric Society is to promote 
the development of psychology as a quantitative rational science. This 
concept of quantification involves the formulation of hypotheses in 
mathematical form, their development into a consistent quantitative 
psychological theory, and quantitative tests of the agreement between 
theory and experimental data.” 

Most of you will recognize in this quotation the official statement 
of the object of our Society, as given in Article I of its Constitution. 
I should like to call your attention to one feature of this statement. Aft- 
er the first sentence, an attempt is made to clarify the concept of psy- 
chology as a quantitative science. The term “rational” is not men- 
tioned again. 

The Psychometric Society was founded eleven years ago today. 
A great deal has been accomplished during these years in developing 
and applying quantitative methods. On the other hand, at least in the 
areas where the chief working tool is the psychological test, very little 
has been done toward the development of a rational science. In con- 
sequence, many otherwise excellent mathematical studies have taken 
their starts from assumptions which do not correspond with the ac- 
tualities of test structures and experimental controls. It is time to re- 
verse this trend, and to emphasize and develop the rational founda- 
tions of mental measurement. 

A psychological test consists simply of a set of verbal or other 
symbols printed in a booklet, or of some sort of apparatus to be manip- 
ulated. A test performance consists of certain aspects of the set of re- 
actions of an individual to such a set of symbols or apparatus, under 
more or less standardized conditions. The basic operation in all testing 
is the response of an individual to an item. If the item is constructed 
properly, and if the testing conditions are properly controlled, it is 
possible to label one or more particular responses to the item as “right” 
and all other responses as “wrong.” We can then assign a numerical 


* Address of the retiring President of the Psychometric Society, delivered at 
Philadelphia, Pennsylvania, September 4, 1946. 
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index, say 1, to each “right” response, and another numerical index, 
say 0, to each “wrong” response. Having done this we have quantified 
the record of each response. Such a record then becomes, in the crud- 
est sense, a measurement. } 

The ability to produce correct responses to some items may be 
considered valuable in and of itself. Examples of such items are the 
meanings of common words, the fundamental arithmetical combina- 
tions, and the spellings of the words most frequently used in written 
communication. In such cases each separate item may be considered 
a test. More often, even in cases such as the ones mentioned in these 
examples, the responses to a set of test items are taken as indices of 
some wider ability, such as vocabulary, arithmetical proficiency, or 
spelling ability. The ability itself is defined in terms of a delimited 
universe of logically similar items. The test consists of a random 
sample — or more commonly a stratified sample — of items from this 
universe. A test of this type is commonly termed face-valid. 

In general, however, a test performance is taken as a partial in- 
dex of an ability or trait which is assumed to be more general than the 
particular universe sampled by the test items. It is believed that the 
reaction-capacities of individuals form systems. These systems may be 
based on structural mechanisms which differentiate such functions as 
memory, reasoning, and perception. They may also be based on the 
fundamental] sets of symbols, such as language and number, or on the 
facts and principles taught in a particular school subject, or on the in- 
formation and skills developed in a specific occupation. 

The object of any psychological test performance is to predict the 
average quality of some criterion performance. A criterion perform- 
ance consists of those elements of the behavior of an individual, under 
some specified set of environmental conditions, which are considered 
pertinent to a defined scale of values. When the average quality of 
performance has been evaluated, and the record of the evaluation has 
been quantified, we term this record a criterion score. 

In order to predict criterion scores, we establish conditions of a 
particular type which we call test conditions. The examinations or 
tests themselves are the major but not the only elements of these test 
conditions. Test conditions differ in one important respect from cri- 
terion conditions. They are designed to permit the convenient assign- 
ment of numerical indices to certain elements of the responses — 
namely, those having to do with the appropriateness or correctness of 
the reactions to the problems stated by the test items. 

A test item is valid, with respect to a given criterion, in the de- 
gree to which the average criterion score of those who pass it exceeds 
the average criterion score of those who fail it. This of course is an old 
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story. The simplest case, however — namely, that of a face-valid test 
—has often been misunderstood. In such tests, the criterion perform- 
ance to be predicted is the proportion of correct responses which the 
examinee would make to all the items in a universe from which the 
items of the test itself are a random or stratified sample. The crite- 
rion score will be simply the count of correct responses to the items of 
the test. These items undoubtedly draw upon a multidimensional uni- 
verse of reactions. It is only the value element in the criterion which 
may be considered to be a linear continuum — the judgment, that is, 
that those who react successfully to a larger number of items are 
somehow superior to those who react successfully to a smaller number. 
Individual items may vary considerably in quality as predictors of the 
results of such counts, and their qualities may vary still further when 
they are considered as elements of groups — i.e., tests — rather than 
separately. This condition obviously becomes still more complex when 
the criterion performance is entirely separate from the test perform- 
ance. 

A test score is not a linear measurement at all. Ideally, the items 
should be treated as independent variables in a multiple regression 
equation. In practice, unit weights often provide a good enough ap- 
proximation to such regression weights. The point to be noted, how- 
ever, is that a test score is always a weighted-composite predictor, 
even when the weights are all unity, and when the criterion score it- 
self is a count of the number of correct responses to the test items. 

Every once in a while it is suggested that a psychological test 
can rank a group of examinees accurately, even if it cannot measure 
them with equal units. This is a fallacy. At best there is only a good 
probability that an individual who has a higher test score will also 
have a higher criterion score than will another individual who has a 
lower test score. It cannot be too strongly emphasized that this is true 
even when the criterion is the so-called “true” score on the test itself. 
This is in fact the root problem of reliability. 

In practice, as all of you know, we quite often can and do treat 
test scores as though they were approximately linear measurements, 
rather than merely composite predictors. We do this so frequently, in 
fact, that we often fail to note that while the procedure is fairly 
logical and useful with some kinds of tests, it is quite illogical and even 
pernicious with some others. In discussing this problem it may be use- 
ful to begin by considering a somewhat more obvious case. 

Suppose that for each individual in a large group of school chil- 
dren we add height in inches, weight in pounds, and age in months. It 
would be comparatively simple to construct a gadget that would pro- 
vide the readings upon a single dial. Any school child could give the 
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proper name to this variable. He would call it “size”. Now the first 
question is this: Would “size,” so defined, be a totally useless variable 
for research and service purposes? 

Quite the contrary. It would in fact be an excellent variable for 
many purposes. Measures of “size” would have high reliability. They 
would correlate fairly highly with grade placement in school, and with 
the scores of school children on tests of mental ability and of school 
achievement. Physical education teachers would find it a useful index 
in setting up quasi-homogeneous groups to play athletic games. It 
would probably yield a better measure of general physiologica] matur- 
ity than would, say, an index of the ossification of the carpal bones. 
It would provide a fairly valid index of social maturity, and unless the 
social psychologists provide us with a better one, “progressive edu- 
cators” might soon be advocating that school children be promoted on 
the basis of “size” rather than on the basis of achievement. 

On the other hand, research workers would not feel quite happy 
in using “size” as a variable in multiple regression equations, along 
with aptitude test scores and other measures, for predicting academic 
success or job proficiency. It would probably contribute substantially 
to such predictions, but the research workers would deplore the loss 
of information involved in substituting unit weights for regression 
weights in combining age, height, and weight. Factor analysis work- 
ing with physical measures and measures of strength, agility, and the 
like, would reject it at once on account of its obvious factorial com- 
plexity. 

The only trouble, in fact, with a measure such as “size,” is that 
there is no way to define what it measures, apart from the arbitrary 
rule of combination of the three basic variables of which it is com- 
posed. Let us look at these basic variables a little more closely. Age is 
measured in the standard units of the time scale, starting from an 
arbitrary zero-point at birth. This zero-point is comparatively close 
to the more logical “true” zero-point at conception. Height is meas- 
ured in the standard units of length, by placing the measuring stick 
in a vertical position with the floor as the zero-point. Length is linear 
by definition; whether time is simultaneously linear we shall be con- 
tent to leave to the relativity-theorists. But the causes of the height of 
an individual are certainly quite numerous and complex. Weight is 
still more complex; with respect to height it is fairly clearly a cubical 
measure, but with respect to gravity it is equally clearly a linear 
measure. It can be defined operationally also, in terms of adding equal 
unit weights to a balance. 

There is no particular difficulty, then, in using as a single pre- 
dictor a linear composite of any set of items or tests. The intercor- 
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relations among them may have any values whatever. The only loss is 
in the substitution of unit or arbitrary weights for the regression 
weights. To form a scale, however, a set of test performances must ex- 
hibit some property or properties on the basis of which we can super- 
impose a linear continuum upon the complex of responses to the items. 
We will not, in general, be able to apply external scales, operationally 
defined, as in the cases of height and weight. There are of course a 
few exceptions, such as work-limit performance tests which are evalu- 
ated entirely in terms of time, and criterion scores based on counts of 
accidents or of units of output or spoilage. 

Criterion performances must be scaled in some manner before 
they can be predicted. Regardless of the complexity of the underlying 
behavior-patterns, we may judge the quality of the results in terms of 
a single value scale. The simplest of such scales results when we mere- 
ly select one group of individuals who are judged to be superior in cri- 
terion performance and another group who are judged to be inferior. 
You are all familiar with more complex methods, which involve the 
use of rating scales, ranking schemes, objective records of perform- 
ance, and various more or less arbitrary combinations of such indices. 
If the correlations between such criterion scores and the scores on pre- 
dictor variables turn out to be linear, both the criterion scores and 
the predictor scores may be considered to possess the attribute of 
linearity to a sufficient degree for the purpose at hand. 

The most important requirement for a test whose scores are to be 
interpreted as measurements would seem to be that its items all draw 
upon the same set of abilities and traits. This implies that the inter- 
item correlations should form an essentially hierarchical system. In 
testing the hypothesis of hierarchy, we should of course use tetracho- 
ric correlation coefficients to avoid the introduction of irrelevant dif- 
ficulty factors. It does not appear to be necessary that there be only 
one general factor. If this is necessary, in fact, then factor analysis of 
test scores would appear to be logically impossible. But there should 
not be any large group factors present in sub-sets of the items of the 
test. To say the same thing in another way, every common factor in 
the test should be present in every item of the test. The more item, we 
have in such a test, the more important the general factor or factors 
become, relative to the specific factors, as determiners of the total 
scores. Many tests of this type can be built quite readily. 

A less important requirement, but one which should not be neg- 
lected, concerns the distributions of the item difficulties and the item- 
test correlations. If the test is to be used to predict a single criterion 
at a particular critical score level, the items should all be of approxi- 
mately equal difficulty. The per cent passing each item, in a group of 
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individuals having criterion scores close to the critical score, should be 
half-way between the per cent who would pass the item by chance and 
100 per cent. 

If the test is to be used to differentiate individuals throughout 
a fairly wide range of abilities, the item difficulties should be distribu- 
ted rectangularly. This can be accomplished with sufficient ‘accuracy 
by converting the per cent passing each item to a standard score by 
assuming normality, and arranging the items at roughly equal inter- 
vals along the standard score scale. It is not important that successive 
intervals be equal, but it is important that there be no severe thinning 
out or piling up of items in any one region of the scale — particularly 
near the ends or at the middle. This procedure for item selection 
tends to equalize the standard errors of the scores throughout their 
range, at the same time providing approximately equal score units. 

The average item-test correlation should exhibit no systematic 
variation with difficulty. We should therefore use biserial correlation 
coefficients for this purpose. Whether items should be retained which 
are so easy or so difficult that they cannot be as valid as those nearer 
the center of the scale is a matter to be determined by the use to which 
the test is to be put. 

Let us consider, finally, the difference between the aims and me- 
thods of applied mathematics and of quantitative science. Mathemat- 
ics commences with a set of postulates and proceeds to deduce their 
logical consequences. A properly stated postulate implies no con- 
ditions other than those which are explicit in its own statement and in 
the definitions of the terms used. Quantitative science, on the other 
hand, consists in devising experiments whose controls correspond one- 
for-one to the postulates of some mathematical theory and of inter- 
preting the results in terms of the logic of this theory. If this is im- 
possible, the scientist must provide the mathematician with a new set 
of postulates which do correspond one-for-one with his proposed ex- 
perimental controls. The crucial scientific problem is precisely the 
one which the mathematician as such must necessarily ignore — name- 
ly, the problem of whether or not his experimental design contains 
implications which are not present, or lacks implications which are 
present, in the postulates of the mathematics which he proposes to use 
in interpreting his fndings. 

In many of the papers which have appeared in Psychometrika, 
there has been no definite attempt to relate the postulates employed 
to the experimental controls which they imply. Psychometric science, 
so far, is not in danger of becoming too quantitative. But it does ap- 
pear to be in danger of becoming too much a branch of applied mathe- 
matics, and too little a branch of rational quantitative science. 
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VARIATION IN TEST VALIDITY WITH VARIATION IN THE 
DISTRIBUTION OF ITEM DIFFICULTIES, NUMBER 
OF ITEMS, AND DEGREE OF THEIR 
INTERCORRELATION 


HUBERT E. BROGDEN 


PERSONNEL RESEARCH SECTION 
THE ADJUTANT GENERAL’S OFFICE 


The relation between item difficulty distributions and the “va- 
lidity” and reliability of tests is computed through use of normal 
correlation surfaces for varying numbers of items and varying de- 
grees of item intercorrelations. Optimal or near optimal item diffi- 
culty distributions are thus identified for various possible item diffi- 
culty distributions. The results indicate that, if a test is of conven- 
tional length, is homogeneous as to content, and has a symmetrical 
distribution of item difficulties, correlation with a normally dis- 
tributed perfect measure of the attribute common to the items does 
not vary appreciably with variation in the item difficulty distribu- 
tion. Greater variation was evident in correlation with a second 
duplicate test (reliability). The general implications of these find- 
ings and their particular significance for evaluating techniques 
aimed at increasing reliability are considered. 


This paper will be concerned with determining the distribution 
of item difficulties which will maximize the correlation of the test 
with a perfect measure of the characteristic the test is intended to 
measure; and with the magnitude of the changes in this correlation 
with variation in the item difficulty distributions. It is desired to 
determine this distribution for tests with varying numbers of items 
and with varying degrees of item intercorrelation and, incidentally, 
to evaluate the effect of these latter variables and their interactions 
with item difficulty distribution. 

The problem of the maximal item difficulty distribution is com- 
plex from the theoretical viewpoint. Although it can be seen that 
with perfectly valid items, their difficulty values should, like points 
on a yardstick, be equally spaced,* as the items involve more and 
more error and thus become less and less valid, it is probable that 
the optimal distribution involves closer grouping of the difficulty 
values around the fifty per cent value. The latter value is optimal 
for a single item (7) and for a group of items which all correlate 


* When expressed in terms of standard score scale values—not percentage 
correct. 
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with the criterion but which do not intercorrelate. The question as 
to just how closely the items should be grouped around the fifty per 
cent value or just “how much difference it makes” has no immedi- 
ately obvious answer for cases intermediate to the two extremes 
just mentioned. The present study is designed to provide answers 
to these questions. 

A number of writers have concerned themselves with similar 
problems. Richardson (5) has considered the effect of item difficulty 
distribution on tests designed to cut at a particular point, and con- 
cluded that optimal results are obtained if items have percentage 
passing values equal to the percentage above the point of cut on the 
test as a whole. Thurstone (7) has indicated the desirability of items 
which approach .5 difficulty values. Gulliksen (2) has undertaken 
proof of several propositions concerning test reliability (which will 
be considered in more detail later). Tucker (8) has determined the 
validity of tests homogeneous with respect to item difficulty for vari- 
ous numbers of items and various assumed item intercorrelation val- 
ues. The method and results overlap somewhat with those of the 
present article and will likewise be discussed at appropriate points. 


1. Derivation of Formulas 

The problem wil! be approached from a semi-empirical basis. 
Although the intercorrelations of the variables involved are too com- 
plicated a function of the factors in which we are interested to allow 
a simple solution to the problem of selecting optimal distributions of 
the difficulty values for various numbers of items and for various 
degrees of item intercorrelation, it is possible to employ tables to de- 
termine individual correlations, and to substitute the sum of the 
items and the continuous variable directly in the product-moment 
formula to determine the “validity” and reliability of the sum. The 
formulas will be derived in such form that the results can be com- 
puted with minimum effort from correlation surface tables. 

The correlation of a sum with a continuous variable is, by direct 
substitution in the product-moment formula, 


D(X, + X_ +--+ + X,)/N — Mi Mex 
avta. + 4,7: tz. 


where ¢ is the continuous variable (true score), the X’s are summed 
variables, N is the number of cases, and n is the number of items. 
If we assume that the continuous variable is expressed in terms of 
standard scores, reduce the numerator, and expand the squared sum- 
mation in the denominator, we obtain 


(1) 
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\ DX,X,/N + DX,X,/N +--+ BX,2/N 


If X,, X.,---, X, are considered to be two-category variables with 
values of 1 and 0—as would be the case with test items—the cross- 
products in the numerator of (2) would each become the summation 
of the t values for those scoring correctly on an item or, in other 
words, those obtaining a credit of one, since cross-products for those 
obtaining a credit of zero would obviously vanish. If such summa- 
tions are multiplied by N.i/N.i, where the subscript ci refers to the 
population of those scoring correctly (and receiving a credit of one), 
and P; is substituted for N.i/N or the proportion of individuals 
scoring one on the item in question, the StX;/N expressions of the 
numerator of the right-hand side of (2) become StX;/N.i - Nei/N 
or M, , P; .* The cross-products of the items in the denominator are 


* Note that this is the numerator of a product-moment or point biserial 
correlation. The usual formula for the point biserial coefficient may be obtained 
by dividing by the s.d. of t and s.d. of the item—the latter being equal io VPQ. 
It should be remembered that as the formula is usually written — 


M,—M,- [ P| 


ms 77) 
—the mean of the total group in the numerator must also be included. This 
mean value is zero in the present instance since we are dealing with standard 


scores. Similarly, we may obtain the formula for the phi-coefficient by rearrang- 
ing the elements of the numerator of (3) and dividing by the s.d.’s or VPQ to 
give P;; — P;P;/VP;Q@, VP;Q;. It should possibly be emphasized here that any 
analysis of a set of items aimed at determining the correlation of various possible 
selections or weighted selections—as in a multiple—must employ product-moment 
intercorrelations and validities rather than tetrachoric or biserial coefficients. 
This has in fact just been demonstrated in that the actual summing process em- 
ployed in scoring a test was substituted directly in the product-moment formula 
and the resulting correlation of sums formulas was shown to involve product- 
moment entries. 

However, a marked disadvantage in using the phi-coefficient or the point 
biserial in item selection is still evident when the intercorrelations are not known; 
namely, the fact that they are a function of the difficulty of the item. It would 
—— that implications for item analysis might be suggested somewhat as 
ollows: 

1. When the intercorrelations are known, product-moment item-continuous 
variable coefficients are desirable [assuming of course that they are to be em- 
ployed either in a complete multiple solution or in some procedure involving cor- 
relation of sums—such as that proposed by Wherry and Gaylord (9)]j. If the 
intercorrelations influence the selection in this way the bias in the point-biserial 
item validities does net reduce the accuracy of item selection since the lower 
intercorrelations of the product-moment intercorrelations of the items will com- 
pensate for this bias so as to produce optimal results. Reference to item diffi- 
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equai to unity when an individual scores one on both items, but van- 
ish in all other instances. Hence, the value of the summation divided 
by N will be equal to the proportion of cases scoring one on both 
variables. This will be indicated by the symbol P;;. The diagonals 
will sd equal to the proportion right on that one variable. Ms, will 


equal ¥ S P. Substituting in Equation (2), we obtain 
1 
M, | tes. Ft +R, Fy 


cn 











Vij¢ — —— pie. eee (3) 
\ + Pry + +--+ Pr, 
P., + Po +---+P., — (2 ) 
\ — Peg ae ee, 
or 
=(M,.- Pi) 
ae >" (4) 


| S3e.—(3.)"] 
a | 1 / 
Since Equations (3) and (4) refer to the correlation of sums of 
items with ¢, Sz has been substituted for }X in the subscript of 7. 
Equation (4) is quite general and can be employed to determine the 
correlation of the sums of any set of two-category items with any 
continuous variable. It can be seen in Equation (4) that there are 
infinite varieties of possible assumptions as to the item intercorre- 
lations, and correlation between the continuous variable and the 
items. In the present article we are concerned with the effect of 
variation of the distribution of item difficulties on the validities of 
the test. Since as a general rule tests consist of items intended to 
measure at the point of cut the same general function, we will assume 
for the purposes of this article that component items in all tests or 
aggregates of items satisfy this condition, and further, that continu- 
ous variable ¢ is a perfect measure of the function common to the 
item aggregates. It will be convenient to assume also that all items 
measure the common function with the same degree of accuracy. 





culties would be unnecessary. In obtaining the optimal selection of items the 
optimal distribution is automatically obtained. 

2. If the product-moment intercorrelations are not available, biserial va- 
lidities are possibly preferable in that the difficulty bias so evident in point 
biserial validities tends to be avoided. The item difficulties can then be considered 
in obtaining the proper degree of discrimination at various difficulty levels. 
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The assumption that all items measure at the point of dichoto- 
mization the same function to the same degree does not mean, how- 
ever, that the product-moment intercorrelations of the items will be 
equal or that they can be accounted for by a single factor [see (1) 
and (8) on this point], since such coefficients are a function of the 
difficulty of the items. It is, in fact, only because of this that the 
present investigation has been undertaken. This assumption that the 
items of an aggregate measure a single common characteristic to the 
same degree becomes meaningful and reasonable if it is stated in 
terms of tetrachoric and biserial correlation coefficients; that is, if 
it is assumed that the tetrachoric intercorrelations of the two cate- 
gory items are equal and that the biserial correlations of the items 
with a continuous normally distributed and “perfect”? measure of the 
common characteristic (which we will refer to as true score) are the 
same for all items. In stating our assumption in this manner, it is 
evident that true score is assumed to be normally distributed, and 
further, that if the two category items were continuous variables, 
all correlation surfaces would be normal. Consequently, tables can 
be employed for determining the -P;; values in Equation (4) and, 
through them the product-moment intercorrelations and the true 
score correlations can be computed. 

The assumption of normality of all correlation surfaces is not 
only necessary from the viewpoint of computation labor, but is prob- 
ably also most desirable theoretically, in that such distributions and 
surfaces most closely approximate the various possible surfaces that 
might be obtained in actual practice. The implications from the fol- 
lowing computations are probably more general than those that could 
be obtained from empirical data. It is improbable that item and total 
score distributions and correlation surfaces obtained from any one 
empirical example would be as representative of those to be obtained 
from the general population of empirical examples as are the as- 
sumed normal surfaces. In addition, unless a very large number of 
cases is employed, results computed from empirical data are apt to 
be seriously distorted by sampling error. 

Given our assumptions of a normally distributed perfect meas- 
ure of the characteristic common to the items, and equal tetrachoric 
intercorrelations of items, the proportion answering any pair of 
items correctly, or the P;; values of the denominator of Equation (4), 
may be determined by referring to Pearson’s Tables (4) of normal 
correlation surfaces. From the assumptions concerning the nature of 
the continuous variable, it follows that the biserial correlation of 
each item with true score is the square root of their tetrachoric in- 
tercorrelations (since the tetrachoric intercorrelation is analagous to 
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a reliability coefficient and the biserial correlation with true score 
is analagous to the highest possible validity coefficient). That is, 
if both true score and the items were continuous variables, the corre- 
lation of an item with true score would be the square root of the 
correlation between two similar items. Biserial and tetrachoric cor- 
relations are, of course, equivalent to the product-moment correla- 
tions between assumed normally distributed variables. 

An expression for computing the M; | values of Equation (38) 
from assumed biserial correlations with true score and item difficulty 
values is required. If we express true score in terms of standard 
scores, the usual biserial correlation formula 


M,—M, P n 
‘vis = (5) 
oT a 
reduces to 
fig t= PZ (6) 
and 
M, =i Ze (7) 


M, of Equation (4) is analagous to M, of Equation (7). Substitut- 
ing from (7) into (4) we obtain, after cancelling the P values and 
factoring out the constant 7 


* Tucker (8), in obtaining a solution to the problem of the present article 
for the special case of items equivalent in difficulty, arrived at equations from 
which the test score validities could be directly computed. His initial assump- 
tions were more fundamental mathematically. He assumed that (1) the relation 
of the probability of success on an item to true score on the ability is defined by 
the normal probability integral, and (2) ability, or true score, is normally dis- 
tributed. While Tucker’s derivations lend considerable support to statements in 
the present article which might otherwise be regarded as direct assertions, the 
author feels that the implications of Tucker’s assumptions are less readily under- 
stood by the average psychologist than are the assumptions of the present article 
and are consequently a less convenient base for further derivation. Further, 
they led, in Tucker’s case, to an expression of the results in terms which require 
some re-interpretation. That is, to relate the increase in test validity to increase 
in degree of tetrachoric intercorrelation of the items is more directly meaningful 
than to relate it to increase in the degree of product-moment intercorrelation of 
the items, or te decrease in the spread of the item curve. In interpreting product- 
moment item intercerrelation allowance must be made for the effect of difficulty. 
Spread of item curve—another constant employed by Tucker in presenting his 
results—must also be re-interpreted before it has direct meaning in factor or 
test construction theory. Since Tucker-did not ccncern himself with tests having 
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This is the formula actually employed in the following computation. 


2. Computed Validities of Tests With Various Assumed 
Item Difficulty Distributions 
For the purposes of the present investigation, all assumed item 
difficulty distributions are stated in terms of one-half sigma units 
ranging from 3.0 to —3.0. These correspond to the “percentage pass- 
ing” values of 2.27, 6.68, 15.87, 30.85, 50.00, 69.15, 84.13, 93.32, and 


97.73. 
Four types of item difficulty distributions will be involved in the 


following discussion, namely: (1) rectilinear (that is, equal numbers ‘“~ 


of items for all of the above difficulty values) ; (2) normal; (3) all 
items at .5; and (4) a skewed distribution. The latter was included 
because it was desired to determine the extent of the decrement caused 
by the skew. With an exception noted below, 7,si)¢ was determined 
for each of these distribution “types” for N’s of 9, 18, 45, 90, and 
153, and for each of the four hypothetical continuous item intercorre- 
lation values of .2, .4, .6, and .8. In the skewed distribution, .22 of 
all items had difficulty values of —2.0, .44 had difficulty values of 
-—1.5, .22 had difficulty values of —1.0, and .11 had difficulty values of 
—.5. Since the “normal” distributions could only approximate nor- 
mality, the exact distributions are listed in Table 1. The nine-item 


TABLE 1 
Item Difficulty Frequency Distributions Listed As Normal 
S.D. Values 
—2.0 —15 -—10 — 5 0 5 EO 1.5 2.0 
% Values 
02 07 16 31 50 69 84 93 98 
1 d 1 Z 38 4 3 4 1 1 18 
2 2 3 5 8 9 8 5 3 2 45 
3 4 6 10 16 18 16 10 6 4 90 
4 6 10 19 27 29 27 19 10 6 153 


normal distribution was excluded since the best approximation to a 
normal distribution possible with that number of items was consid- 
ered too rough to be meaningful. 

In Table 2 the 7,x;); values are listed. Table 2 indicates that, 
with item tetrachorics of .2 or .4, it is advantageous to group the 





items of varying difficulty, the need for stating his assumptions in terms of ability 
factors and of avoiding difficulty factors was not of particular importance. In 
the present article the necessity of avoiding confusion between these two types 
of factors led to stating the initial assumptions in terms of tetrachoric correla- 
tions and expressing the final results in terms of the assumed tetrachoric inter- 
correlations of the items rather than in terms of the phi-coefficient or the degree 
of spread of the item curve. 
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TABLE 2 
Test Validities by Distribution Form, Number of Items, and 
Assumed Item Intercorrelations* 


Item Inter- Type of Distribution 

Correlations N All .5 Rectilinear Normal Skew 

9 753 653 —} 600 

18 849 771 807 715 

2 45 928 883 905 828 

90 961 932 945 879 

153 974 954 963 903 

9 860 794 —F 708 

18 917 876 897 783 

4 45 957 939 950 842 

90 971 965 969 864 

153 977 976 977 874 

9 896 868 —t 741 

: 18 930 928 933 786 

6 45 950 962 964 817 

90 958 976 975 828 

153 961 982 979 832 

9 894 919 —+t 738 

18 911 953 953 765 

8 45 921 975 969 780 

90 926 983 975 785 

153 927 986 977 787 


* Decimals normally preceding each entry have been omitted. 
+ See text for reason for excluding 9-item normal distribution. 


items around percentage passing values of 50.0, although the advan- 
tage becomes inappreciable as the number of items in the sum is in- 
creased. With tetrachoric intercorrelations as high as .8, grouping 
of the items around percentage passing values of 50.0 is disadvan- 
tageous, even with a small number of items, and becomes more and 
more undesirable as the number of items is increased. While the 
variation in true score correlation with variation in item difficulty 
distributions is not large—except with the skewed distribution—it 
should be noted that variation in true score correlation with varia- 
tion in assumed tetrachoric correlation, or’ with numbers of items, 
is also not too large—especially if we exclude from consideration item 
sums or “tests” containing fewer than 18 items. 

Only with rather considerable skew are the correlations between 
the test and true score heavily influenced by this characteristic of 
the item difficulty distributions. It would seem apparent, however, 
that at all levels of item intercorrelation and for all numbers of items, 
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the disadvantage of the skewed difficulty distribution is appreciable. 
It is, of course, most evident with high intercorrelations, and since 
the skewed distribution chosen for these calculations is only one of 
a very large number of possibilities, the true score correlations for 
sums of items with various constant difficulty values were deter- 
mined and are presented in Table 3. Table 3 is intended to supple- 


TABLE 3 
T xi), Walues for Items with Constant Difficulty Values by Difficulty 
Value, Number of Items, and Assumed Intercorrelation* 


Difficulty 
r N —20 —15 —10 —5 0 5 1.0 1.5 2.0 


9 425 562 667 731 753 731 667 562 425 

18 535 674 780 830 849 830 780 674 535 

2 45 663 785 866 918 928 913 866 785 663 
90 732 836 906 947 961 947 906 836 732 

153 767 860 924 962 974 962 924 860 767 


9 503 650 765 836 860 836 765 650 503 

18 577 719 827 894 917 894 827 719 577 

A 45 641 773 873 935 957 935 873 773 641 
90 667 793 890 950 971 950 890 793 667 

153 679 802 898 957 977 957 898 802 679 


9 504 658 786 868 896 868 786 658 504 

18 543 696 821 902 930 902 821 696 543 

6 45 571 722 845 923 950 923 845 722 571 
90 582 731 853 931 958 931 853 731 582 

153 586 735 856 934 961 934 856 735 586 


9 466 629 769 862 894 862 769 629 466 

18 482 646 786 879 911 879 786 646 482 

8 45 493 657 796 889 921 889 796 657 493 
90 497 661 800 893 926 893 800 661 497 

153 498 662 802 894 927 894 802 662 498 


* Decimals normally preceding each entry have been omitted. 


ment the last column of Table 2 by showing the effect of varying item 
difficulties when all items have the same difficulty values. Rough 
estimates of the validity of more markedly skewed distributions can 
be obtained in this manner. 

It is apparent from Table 3 that true score correlation is maxi- 
mal at 50.0 difficulty, and that deviations of the difficulty value from 
50.0 is accompanied by considerable decrement in validity which in- 
creases as the number of items decreases, and as the intercorrelations 
of the items increase. It is least disadvantageous with 9 items and 
an assumed tetrachoric intercorrelation value of .2, while it is most 
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disadvantageous with 153 items and an assumed tetrachoric inter- 
correlation value of .8. The magnitude of the decrement becomes in- 
creasingly large with each successive .5. drop in item difficulty. 

A rather surprising incidental result is the drop in validity ob- 
served in rows 2, 3, 4, and 5 of Table 2 and quite frequently in Table 
3, as the assumed tetrachoric intercorrelation value increases.* This 
curve could be extrapolated further by indicating the value of .798 
for 7,xi); when the assumed item tetrachoric intercorrelation and the 
assumed biserial correlation with true score is. plus one. Substitut- 
ing actual values in Equation (7), we obtain nZ/(n* .25)’. The n’s 
cancel, and the height of the ordinate at the median (.399) over .5 
gives the aforementioned value of .798, as the true score correlation 
for a test with all items at .5 difficulty and assumed item tetrachoric 
intercorrelations of plus one. 

This seemingly paradoxical finding may be explained as follows. 
It is apparent first of all that with intercorrelations of unity the 
test can, at best, bisect the criterion distribution, and secondly the 
sum of continuous variables which intercorrelate to the same degree 
will correlate perfectly with the factor common to them (true score) 
as the number of variables approaches infinity, no matter how low 
the assumed intercorrelation value. This is evident from examina- 
tion of the formula for correlation of sums, which reduces to 


nr: x (9) 
7 2X)t = ‘i @ ] 
(2X) [n a n(n 1) x 





when the intercorrelations are assumed to be constant and, as in the 
special case we are considering, the square root of the assumed inter- 
correlation is their correlation with true score. If we divide the 
numerator and denominator of Equation (9) by 2 to give 

ren = —2 7 (10) 

[1/n + (1—1/n)ryx]} 

or what might be called the correlation of averages, it is evident 
that, no matter how low the value of ryy, 7:sx), can be made to ap- 
proach unity by increasing n. Note that the standard error of esti- 
mate of the average must approach zero as =, approaches unity (as 


a function of the increase of n). The standard deviation of single 
continuous variable scores for different individuals having the same 
true scores or for the same individual but different item variables 


would be the standard error of estimate or ox\/1 — 7?x;. When rx; 





* This phenomenon is evident in Tucker’s results (8). 
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is small and the number of variables is large, the distribution of 
scores obtained by a single individual would tend to be normal with 
an s.d. approaching unity but with a mean (and other parameters) 
that can be accurately predicted. If, now, we arbitrarily reduce the 
continuous variables to two-category items and allow a credit of one 
for all scores above a given value and consider his average score on 
a number of two-category items, it can be seen that this would be 
the proportion of scores in the distribution which exceeded whatever 
value had been selected to determine failure. Such scores would, of 
course, have a curvilinear relation to true score in that the number 
of items passed would be a portion of the area of a normal curve. In 
fact, if 7x, were very small, and the number of items in the test very 
large, 7. would approach unity, and the mean (if expressed as a 


standard score) of the distributions of item-continuous variables for 
each individual would bear a one-to-one relation to the true score. 


3. Generalizations and Implications 
It will be attempted in the following paragraphs to give some 
indication of the results that would be obtained under conditions 
which do not satisfy the assumptions made in computing the entries 
of Tables 2 and 3. 


The true score correlations of Tables 2 and 3 can be easily con- 
verted to correlations with an external criterion as long as the corre- 
lation surface of the criterion and true score are normal and the 
tetrachoric intercorrelations of the items may be explained in terms 
of error and a single common factor. Examination of Equation (8) 
indicates that in determining the correlation with such an external 
criterion the denominator would remain unchanged. From our as- 
sumption that all biserial correlations between items and the exter- 
nal criterion are due to a single factor (true score), the correlation 
between the tests and the external criterion would be “wy - 7(si)¢ OY 
the product of their correlations with t; that is, the product of their 
loadings in the factor common to them. Thus, if it is assumed that an 
external criterion correlates only .5 with true score—which would be a 
rough general average of those found in practice—all of the “tests” 
of Tables 2 and 3 would correlate with the external criterion to one- 
half the extent that they correlate with true score. This would, of 
course, mean that differences between test validities against this ex- 
ternal criterion would also be one-half the size of the differences be- 
tween the various true score correlations. Thus, from Table 2 an in- 
crease in number of items from 18 to 45 in a test with a rectilinear 
distribution of difficulties and assumed item-continuous variable inter- 
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correlations of .2 increases the true score correlation .112, but would 
increase the correlation with the posited external criterion only .056. 
The problem of the factor structure of the underlying item vari- 
ables is somewhat more complicated. While the results discussed 
cannot be generalized exactly to problems where more than one fac- 
tor is involved except with highly restrictive assumptions, there are 
certain useful statements that can be made with some confidence. 
Let us assume that a test is composed of two clusters of items with 
zero correlations between clusters, equal correlations (biserial) with 
true score, equal intercorrelations (tetrachoric) within clusters and 
with the same item difficulty distributions. The correlation of the 
sum of these two aggregates with true score for the complete test 
would equal their individual correlations with true scores* for that 
item cluster. While a more general formula could be readily derived 
by substituting in the correlation of sums formulas, it would seem 
inadvisable to spend too much time on this relatively unimportant 
issue. The general rule may be stated that the correlation with true 
score for a test involving’ uncorrelated clusters of items (wherein 
items are pure, have equal frequencies, intercorrelations, and diffi- 
culties) for any given determination varies against item difficulty 
distribution and item intercorrelation in the same manner as a sin- 
gle-factor test having as many items as each of the separate factors 
composing the larger test. While loading of items in factors other 
than those involved in true score, differences in weights, intercorre- 
lations, etc., introduce complexities into the problem which would 
be laborious to evaluate exactly for all cases, it is clear that such 
“lack of purity” will usually affect the functions under consideration 
in much the same manner as will a reduction in number of items. 
That is, the possibility of advantage by grouping items close to the 
fifty per cent correct value will become greater as the complexity of 
the factor structure of the variables involved becomes greater. Since, 
in general, item analyses are concerned with aggregates of items 
which probably involve a number of group factors, the upshot of the 
foregoing discussion is that in practical application of the results 
summarized in Table 2, the optimal distribution of item difficulties 
of a test X will tend in the direction of a single-factor test with some- 
what fewer than the actual number of items in test X. This point 
places greater emphasis on the desirability of items in the middle 
‘ranges of difficulty than the results of Table 2 would otherwise have 


suggested. 


* True score would, in this instance, be an unweighted sum of the true or 
common factor scores for the factors involved in each of the clusters. 
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The rather small differences in validity obtained with variation 
of item difficulty distributions, tetrachoric item intercorrelation and 
number of items (as long at least as N is greater than 45) raises 
serious doubts concerning the value of total score item analysis pro- 
cedures in general. If a homogeneous* group of items are subjected 
to usual item analysis procedures with the items with highest total 
score correlation retained, and, as a result, the average tetrachoric 
intercorrelation of the items is increased from .2 to .4 (or the aver- 


age item true score biserial from \/.2 to the \/.4), the effect will be 
to increase the true score correlation of the sum of 45 items from 
.928 to .957 when ail items have .5 difficulty values, from .883 to .939 
with rectilinear distribution and from .905 to .950 with “normal” 
distributions of item difficulties. With a “true” score correlation 
of .5 with an external criterion, the increase in test validity would 
be less than .03. It must be remembered in this connection that, in 
actual practice, item validities are unreliable and, if items are se- 
lected such that a very considerable increase in mean item validity 
appears to occur, a second determination would find considerable re- 
gression toward the mean value for the original group of. items. In 
addition, the mean item validity is not in any event very greatly in- 
fluenced unless a considerable proportion of the items are discarded. 
For these reasons the effects as computed can be considered as maxi- 
mal. We have disregarded in these statements the possibility of skew 
in the item difficulty distributions. Since a heavily skewed distribu- 
tion would affect the validity considerably, this factor should be con- 
sidered before deciding in any practical situation that item analysis 
would not be worth while. However, if the mean score is close to half 
the number of items, it might often be safe to assume a symmetrical 
difficulty distribution. 

Total score item analysis procedures are possibly worth while 
only when it is attempted to cut the time and labor of testing and 
scoring to a bare minimum. The results here summarized seem to 
indicate that tests usually tend to be over-long—that is, they include 
many more items than are required for efficient selection—especially 
when the validity coefficients are in the thirties, forties, or fifties, 
and in the range of magnitude most usually found in practice. In 
such a situation, if the items are reasonably homogeneous, an increase 
in the number of items (above 20 or 30) or in the average validity 
of the items will increase negligibly the validity of the test as a 
whole. The entries in Table 2 bear directly on this point. We will, 
however, consider later some additional undesirable characteristics 


*If the items are nct homogeneous there is some doubt as te whether total 
score item analysis procedures should be employed at all. 


ee 
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of total-score item analysis. 

Most of the foregoing discussion of the optimal selection of item 
difficulty distribution ignores the effect of sampling error. If item 
difficulties were determined on 100 cases, those at the 50% point 
would have a standard deviation of (PQ/N)' or .05. A range of three 
sigmas on either side of the mean would be from .35 to .65. Since 
more than 100 cases are usually employed, and since the range and 
standard deviation are not large in relation to the total range in the 
distribution of Table 2 or distributions apt to be found in practice, 
sampling error very probably does not seriously distort obtained dif- 
ficulty values. 


TABLE 4 
Comparison of the Validities (Squared) of Table 2 with 
Corresponding Reliability Coefficients* 





+ See text for reason for excluding 9-item normal distribution. 


Item Inter- Type of Distribution 
Correlations N All .5 Rectilinear Normal Skew 
v2 KR#2 V2 KR V2 KR V2  KRte 
9 566 569 426 435 —-+ —j 360 407 
18 720 721 594 606 651 659 512 578 
2 45 861 869 780 793 819 829 685 774 
90 924 930 869 885 893 906 772 873 
153 950 957 910 929 927 943 815 921 
9 740 752 ° 630 640 —y —Ft 502 637 
18 841 865 767 781 805 819 614 778 
4 45 916 941 882 899 903 919 708 898 
90 943 964 931 947 939 958 746 946 
153 955 982 953 968 954 975 763 968 
9 803 862 753 766 —F —T 550 781 
18 865 926 857 868 870 904 618 877 
6 45 903 969 925 942 929 957 667 947 
90 918 984 953 970 950 978 685 973 
153 924 990 964 982 958 987 693 984 
9 799 928 845 861 —jt —7 551 882 
18 830 962 908 925 908 943 586 937 
8 45 848 985 951 969 938 979 608 974 
90 857 993 966 984 950 990 616 987 
153 859 996 972 991 954 993 620 992 
* Decimals normally preceding each entry have been omitted. 
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TABLE 5 
Comparison of the Validities (Squared) of Table 3 
with Corresponding Reliability Coefficients* 
Assumed Tetrachoric Intercorrelations 
iz A 6 8 
N V2 KR=2 V2 KR#2 V2 KR#2 y2 KR2#2 


Difficulty 
Percentage 9 534 550 699 749 753 855 743 925 
Correct 18 689 710 799 857 814 922 773 961 
31 or 69 45 834 860 874 93 852 967 790 984 
Baseline or S.D. Values 90 897 924 903 968 867 983 797 992 
0 or —.5 153 925 954 916 981 872 990 799 995 
Percentage 9 445 491 585 709 618 8382 591 914 
Correct 18 608 659 684 829 674 908 618 955 
16 or 84 45 750 828 762 924 714 8961 634 982 
Baseline or S.D. Values 90 821 906 792 960 728 980 640 991 
+1.0 or —1.0 1538 854 943 806 976 733 988 643 995 
Percentage 9 315 391 423 635 433 789 396 895 
Correct 18 454 562 517 776 484 882 417 945 


45 616 762 598 897 521 $49 432 977 


7 or 93 
Baseline or S.D. Values 90 699 865 629 946 5384 974 437 988 
1.5 or —1.5 153 740 916 643 967 540 985 438 993 
Percentage 9 181 264 253 521 254 722 217 867 
Correct 18 286 417 833 686 295 ©6839 232 929 
2 or 98 45 440 642 . 411 845 326 929 243 970 
Baseline or S.D. Values 90 536 782 445 916 339 = 968 247 985 
2.0 or —2.0 153 588 859 461 949 3438 978 248 991 

* Decimals normally preceding each entry have heen omitted. 


4. A Comparison of the Effect of Variation in Item Difficulty 
Distribution on “Validity” and Reliability 
Kuder-Richardson Case 2 reliabilities (3) were determined for 
all “tests” of Tables 2 and 3 and are presented together with the 
squares* of the corresponding “true score” correlations as Tables 4 
and 5. Before discussing Tables 4 and 5, it will be heipful to clarify 
our definition of the term “reliability.” We will use the term as 
Kuder and Richardson have defined it; that is, as the correlation 
between two tests exactly comparable to each other, item by item. 
The reader is reminded here that it was assumed in computing the 
coefficients of Table 2 (and the corresponding K—R Case 2 reliabil- 
ities) that the ‘tests’? have constituent items whose tetrachoric inter- 
* The square of the reliability is the upper limit of the validity. This is 


equivalent to the more usual statement that the upper limit of the reliability is 
the square root of the validity. 
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correlations may be accounted for by a single factor with equal load- 
ings of all variables. 

Now the square of the validity as we have defined it differs from 
the reliability as defined by Kuder and Richardson and as computed 
in K—R Case 2 (there are no approximations in the K—R Case 2 for- 
mula) only in that the validity is the correlation of the given test 
with a continuous normally distributed and perfect measure of the 
single attribute measured by the test, while the reliability is the cor- 
relation with an equivalent test which exactly duplicates any ten- 
dency of the given test toward error of measurement or concentra- 
tion of efficiency of discrimination at some point or range on the dif- 
ficulty scale. In examining the differences between the validities and 
reliabilities of Tables 4 and 5, then, we are concerned with the differ- 
ential effect on reliability and “validity” of variation in the factor 
pattern of the product-moment intercorrelations or, in other words, 
the effect of the difficulty factor(s) introduced as the variables are 
reduced to two-category items. 

The variation is slight in the case of tests with few items all at 
50% or with the tests whose items form a skewed difficulty distribu- 
tion and quite appreciable with larger numbers of items and recti- 
linear or normal item difficulty distributions. It is largest in the case 
of tests with items concentrated at a particular point considerably 
divergent from 50%. Note the drop in validity that has been dis- 
cussed at some length in connection with Tables 2 and 3 and the rise 
in the reliabilities corresponding to these validities. This differential 
effect on reliability and validity apparently increases as the dispersion 
of the item difficulty distribution increases and as the mean difficulty 
value diverges from 50%. The effect of increase in the number of 
items is very nearly the same on both reliability and validity. There 
seems possibly some slight tendency to obtain greater increase in re- 
liability than in validity. This tendency becomes appreciable, how- 
ever, only in the case of skewed item difficulty distributions with 
high item intercorrelations or in the tests with items concentrated 
at very high or very low difficulty values. 

In a recent article (1) Gulliksen demonstrated that the reliabil- 
ity of a test increases: 

a. As the average item intercorrelation of the test increases. 

b. As the dispersion of the item difficulties decreases. 

c. As the mean item difficulty approaches 50% correct. 

It can be seen that variation in the reliabilities in Tables 4 and 5 is 
in agreement with Gulliksen’s three theorems. 

Neither of the first two of Gulliksen’s propositions holds gen- 
erally in describing the behavior of validities even with the rather 
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restrictive definition of validity employed in the present study. While 
it might be argued that the proposition (a) would hold under most 
conditions, it should be noted that even though decreases in validity 
with increase in item intercorrelation do not occur for most cases, 
the rate of increase is almost_negligible in a considerable number. of 
instances. This is particularly true when the number of items is 
large and the dispersion of item difficulties is low. 

It would appear from the foregoing results that it is desirable 
that test constructors realize that reliability as ordinarily computed 
gives the correlation of a test with a second form equivalent to it, 
not only in content but also in difficulty level. The writer has been 
of the opinion that reliability in and of itself is not necessarily de- 
sirable and that evaluation of item selection procedures and princi- 
ples which aim at or obtain increases in reliability with no considera- 
tion of effect on validity may often result in selection of procedures 
which not only effect no increase in validity but actually tend to pro- 
duce the opposite result. An obvious example of such a possibility 
would be the narrowing of the content of the test so that the various 
items become, in the extreme position, identical. Very high reliability 
could be achieved in this way but it would probably be at the expense 
of validity. In the same way a narrowing of the difficulty range of 
the items has been shown to increase the reliability considerably but 
actually to decrease the validity. 

In addition to the points previously mentioned regarding total 
score item analysis, we might add that any concentration of the com- 
posite items of a total score within a given difficulty range will tend 
to bias the correlations of items with total score in favor of those 
items whose difficulty values fall in that range. The final selection, 
if based purely on best prediction of total score, will tend to show an 
exaggeration of the difficulty bias in the original total score com- 


posite. 
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A FIRST-ORDER METHOD FOR ESTIMATING 
CORRELATION COEFFICIENTS 


JOHN A. WEICHELT 


A rapid method of estimating a correlation coefficient is given. 
The method expresses the correlation coefficient as the ratio between 
two differences in sums (or means) of the dependent variable com- 
puted only for extremes of the bivariate distribution. A trial shows. 
that this method gives results similar to the product-moment corre- 
lation coefficient. Extensions of the method to qualitative data are 
also suggested. 


1. Introduction 
A rapid and simple new method has been devised which expresses 


correlation as the ratio between two differences in the sums or means 
of the extremes of the dependent variable. It has the following char- 
acteristics and advantages: 


a 


2 


<r 


The calculations are fast and simple. 

The relationship between two factors is expressed so directly in 
terms of common practical applications that the meaning of cor- 
relation is clarified for the layman. This was the primary pur- 
pose for setting up a new formula. 

The units of measurement subtract and cancel'so that there is 
no need to calculate or estimate variance or the sample mean. 
Application of the method is not limited by assumptions as to 
the type of distribution. The only important assumption is that 
the correlation is approximately linear—otherwise the formula 
has a meaning only in a limited range. 

The population value of this ratio is the same as the product- 
moment coefficient for a normal bivariate distribution. 

It can be applied where the independent variable is qualitative. 
The independent variable is used only for sorting the groups that 
give the means of the dependent variable which appear in the 
numerator of the ratio. 

Under certain assumptions and limitations it can be applied to 
cases where both variables are qualitative. 


2. Description of Method 
Subtract the sum of the smallest Y values from the sum of the 
largest Y values. Take the same number of large values as small 
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values—approximately 20% each of the total number of cases 
in the sample. 

Subtract the sum of the Y values which are associated with the 
smallest X values from the sum of the Y values which are asso- 
ciated with the largest X values. Take the same number of large 
and small cases as in (1). 

3. The ratio of (2) to (1) is the correlation coefficient. 


to 


The following paragraph illustrates this with an example. A more 
precise and a more general definition is presented in the section on 
Mathematical Description. 


TABLE 1 

1. 2. 4 4. 5. 6. 
34787 30490 29296 30585 32068  31919* 
1. 24249 23805 23584 26622 28010 27404* 
1053 6685 5762 3963 4058 4515+ 
1.00 63 51 36 34 36 + 
32938 32245 29588 30965 31011 31798* 
2. 26054 21601 23588 26873 28008  27831* 
6884 10644 5945 4092 30038 3967 + 
65 1.00 53 38 125 32 + 
32226 29867 31984 31077 31085  31383* 
3 26709 24575 20751 26269 28070 28131* 
5517 5292 =: 11288 4808 3015 3252+ 
52 50 1.00 A4 25 26 + 
31231 28929 28994 33793 32091 31978* 
4 27238 24988 242138 22914 28044 27424* 
3993 3946 4781 10879 4047 4554+ 
88 87 48 1.00 B84 36 + 
31108 2833 27901 30697 35058  34030* 
5 27704 25638 25218 26976 23048  25831* 
3426 2693 2683 8721 12010 8199+ 
33 25 24 34 1.00 65 + 
31255 28660 27969 30554 34078  35863* 
6. 27321 25263 24987 26846 26076 23322* 
2934 3397 3032 3708 8002 125414 
37 32 27 34 67 1.00 + 


* Lines printed by tabulator. 
+ Lines written by hand. 


3. Results 


Table 1 is a sample work sheet from an actual job in which the 
six variables are selection test scores for 2560 students. 


Line 1: Numbers 1 to 6 indicate the selection factors for which total 
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scores are listed. All totals for a given factor are in the 
same’ column. 

Line 2: Total scores for the 512 students ranking highest on the 
first score. (Printed by tabulator.) 

Line 3: Total scores for the 512 students ranking lowest on the first 
score. (Printed by tabulator.) 

Line 4: Differences: Line 2 less line 3. (Written by hand.) 

Line 5: Correlation coefficient. Ratio of the difference in line 4 to 
the largest difference in the same column. (Written by hand.) 

Line 6: Total scores for the 512 students ranking highest on the 
second score. (Printed by tabulator.) 

Etc. The ranking, sorting, and adding process is repeated for each 
variable. 


There was a difference of 10,538 between the best and poorest 
scores on factor 1. The difference in scores on factor 1 was 6884 
when the groups were selected according to rank on factor 2. This 
was 65% as much as the difference would have been if all factors had 
been taken into consideration. This ratio is the correlation coefficient. 

This job was performed with punched card equipment. Two- 
digit scores had already been punched on cards for record purposes 
and the same cards were used to tabulate the total scores without any 
conversion process such as might have been needed to get progressive 
digiting for product-moment calculations. 

The mechanical sorter was used to select 12 groups of cards by 
six ranking processes. The printed figures are totals added and listed 
mechanically by the tabulator. The other figures were entered by 
hand. Less than 34 hours were required to complete the entire job 
of calculating all the intercorrelations of six variables for 2560 cases. 


4. Reliability 

Two independent calculations are developed for each correlation 
pair. The average difference between the pairs in the example is 
.012. This might be taken as an indication of reliability although the 
population values are theoretically identical only under certain as- 
sumptions. Under these assumptions it seems reasonable to take an 
average of the two calculated coefficients. 

The second line of each pair of figures in Table 2 shows the cor- 
relation coefficient calculated by the product-moment formula for the 
same data. The average difference between the calculations by the 
two methods is .011. This difference is reduced to .008 if the product- 
moment value is compared with the average of the two independent 
calculations by the other method. It is important to state here that 
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TABLE 2 
1 2 3 4 5 6 

1 68 51 36 34 36 
64 50 37 30 35 

2 65 58 38 25 32 
64 51 37 25 31 

) 52 50 44 .25 .26 
50 51 43 24 .26 

4 38 37 43 34 36 
37 37 43 32 35 

5 33 25 .24 34 65 
30 .25 .24 32 65 


6 37 32 BA | 34 67 
35 1 .26 35 65 


the data used in this example departed significantly from the normal 
type (skewed to the right) and therefore the two methods theoreti- 
cally are not expected to give identical results. A smaller average 
difference would be expected if the distributions were normal. (See 
the section on Mathematical Description.) 


5. One Variable Qualitative 

In a large training school the question was raised as to whether 
it would be better to select trainees by personal interview or on 
the basis of selection test scores. A total of 800 trainees who com- 
pleted the course of instruction had been given selection tests and 
interviewed before entering school. The interviewers, who were con- 
sidered experts, had classified these prospective trainees as excellent, 
good, or satisfactory for the course of instruction. 

The school performance grades were taken as the criterion to 
determine whether the personal interview method was more efficient 
than the selection test scores for screening purposes. The three 
classes of trainees were compared on the scores they received for 
performance. Then they were ranked according to selection test 
scores and again compared on their school performance. Finally they 
were ranked on the performance scores themselves to find the maxi- 
mum variation in scores in the kind of grouping resulting from the 
interviewer classification. The results were as shown in Table 3. 

The coefficients are derived here in the same way as in the previ- 
ous example except that averages are used instead of total scores. 
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TABLE 3 
Average 
School Corre- 
Number of Performance lation 











Students Score Index 
As appraised by interviewers: 
Excellent 136 52.8 
Good 502 
Satisfactory 162 50.2 
Difference ~ 2.64 04 
Ranked on selection test scores: 
Best 136 136 58.6 
Middle Group 502 
Poorest 162 162 42.4 
Difference “162° 23 
Ranked on school performance scores: 
Best 136 136 86.7 
Middle group 502 
Poorest 162 162 16.4 
Difference 70.3 1.00 


This is necessary if the number of cases in the high bracket is not 
the same as the number in the low bracket. These ratios probably 
do not possess all the properties of correlation coefficients, but they 
nevertheless indicate the relative importance of the two factors in 
predicting school performance. 


6. Both Variables Qualitative 
Suppose, for example, a number of salesmen have been ranked 
in order of their performance or sales ability. How important is 
marital status as a factor to be considered when selecting new sales- 
men? The following statistical analysis gives a useful answer to this 
question. 


1. Find what percentage of the salesmen are married. Suppose this 
turns out to be 10%. 

2. Find the number of married men who are in the top 10% bracket 
as salesmen. Express as the per cent of all married men. 

3. Find the number of married men who are in the bottom 10% 
bracket as salesmen. Express as per cent. 

4. Subtract (3) from (2). This represents the association of mari- 
tal status with selling ability. 


The interpretation of this coefficient departs somewhat from that 
of the correlation coefficient. The coefficient simply indicates the im- 
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portance of a factor relative to all factors affecting the measure of 
performance. 

The method for this special case was derived from the preceding 
case by analogy, and requires further study. 


7. Mathematical Description 

Let H and L be arbitrary fractions of the total number of N ob- 
servations of a random sample. Sort out the NH largest Y values 
and identify these as Y,. Similarly let Y, denote the NL smallest Y 
values. Let Y|X, and Y|X, denote all the Y values which are associ- 
ated with the NH largest and the NL smallest X values, respectively. 

The sample correlation coefficient R and the population coeffici- 
ent R, are defined as follows: 


M(Y|Xy) —M(Y|X_) E(Y|X,) —E(Y|X_z) 


M(Yn)—M(Y.) °° —~*E(¥n)—E(Y,) 








1 a eS 
M(Y1) =—SYu; E(Y,) =-- f, Y p(Y)dyY; 
NH H * 
1 ; 2 00 00 
M(Y|X,) TH Y|Xy. E(Y|Xy) =— f J yvocxyyax dy; 


H= [,pQoax= f.emey. 


In the equations on the right p(X,Y), p(X), P(Y) are distribution 
functions, but p(X) is not necessarily the same function of X as 
P(Y) is of Y. The points h, and h, are defined by the arbitrarily 
chosen H . 
If H=L, 

Sy ikea BY iz: 


SYu oon Z Y, 


Let p denote the population correlation coefficient defined by the prod- 
uct-moment formula which minimizes the mean square of the error 
é in the following equation. Then: 
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¥+e~BR=,)— @—eD: 
C, 


r= or M(Xu) —M(X_z) nea M (en) — M (ez) . 

"ox M(Yu) —M(¥1)  M(Yu) —M(Y,) ’ 

- oy E (Xu) — E(X:z) _ E (ex) — E (ex) 

gx E (Yu) —E(¥1) E(¥u) —E(Y2) 

Here M(ex) is M(e|X,) and represents the mean deviation of the 

actual Y values from the line of regression defined by p in the inter- 
val for which X,, is defined. 


The following conclusions are obvious from the last equation. 








. 


Ry Ry=p?, if E(ex) = E(e.), where Ry and R,z- represent 
population values. 

|Ro| > |p|, if the correlation is greater at the ends than in the 
middle and X and Y have the same distribution 
function. 

|Ro| < |p|, if the correlation is greater in the middle than at 
the ends and X and Y have the same distribution 
function. 

Ro=p, if X and Y have the same and symmetrical distri- 
bution functions (as in the normal case). 


8. Summary 

It is believed that a very rapid and efficient method for estimat- 
ing the correlation coefficient has been presented here. Empirical 
evidence indicates a high level of reliability. It is also adaptable to 
certain special cases where other methods cannot be applied directly. 

Most important, however, is the belief that this method presents 
the data in terms which are commonly observed in practical experi- 
ence, so that it brings the technical and non-technical analysts more 
nearly to a common level of understanding. 
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A PREPUNCHED MASTER DECK FOR THE COMPUTATION OF 
SQUARE ROOTS ON IBM ELECTRICAL 
ACCOUNTING EQUIPMENT 


Wo. A. REYNOLDS* 
NATIONAL BROADCASTING COMPANY 


This paper presents a prepunched deck of cards to enable the 
extraction of square roots on standard punch card tabulating equip- 
ment. Such a deck is valuable in constructing mathematical tables 
which involve square roots or in obtaining standard deviations in 
connection with computing correlation coefficients. By using a deck 
of reciprocals in conjunction with the deck for square roots, corre- 
lations may be solved completely on IBM equipment. 


1. Introduction 

The value of prepunched decks of cards for obtaining statistics 
such as percentile ranks, percentages, and product-moments has been 
recognized for some time. A review of the literature, however, does 
not reveal that the solution of square roots has been made available. 
It is the purpose of this paper to present a prepunched deck of cards 
to enable the extraction of square roots on standard punch card tabu- 
lating equipment. 

Such a deck would have value, for example, in the construction 
of mathematical tables in which a root appears, or in turning out 
masses of standard deviations to be used for the solution of corre- 
lation coefficients. In conjunction with a deck of reciprocals, the 
correlation formula may be solved completely on IBM equipment, 
without the necessity heretofore experienced of having to resort to 
hand-operated calculating devices. 


2. Derivation of Formula 
To secure the square root of x, where x is a number of over 


three digits. 
Let x =—N+K,N being the first three digits followed by as 


many zeros as digits in K} and K being the following digits. 


* The author wishes to acknowledge the help given on machine procedures 
by Wallace M. Taylor, 1st Lt. A.C., Wright Field, Ohio. 

+ The expression is fully stated by «x = 10YN + K, where y is the number 
of digits in K. For simplicity, the 10” has been dropped in the following pages. 
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By the binomial expansion, 


Vi=VN+K=(N+K)*2=N?2+4N7°K 
— 4 N-*/*K? + te N-5/2 (1) 


Combining the first two terms on the right, 





Va = (2N + K) ( = 4h N-9/2K? + qty N-9/2K3..-, (2) 


2VN 


Dropping terms to the right, a close approximation is obtained. 


. 1 
Vx = (2N + K) at (3) 
2VN 


The value of (1/2\/N) is a constant. The multiplication of the 
two factors results in a close approximation to //x. 

Example: Find \/.251568 where N = .251000 and K = .000568. 
By Formula (3), 





\/.251568 = (.502000 + .000568) {—— 
2\/.251 
= (.502568) (.99800598) 


\/.251568 = .50156587 . 


3. Description of the Prepunched Deck 
The Master Deck consists of 1800 cards punched successively 
with values for N of three digits (Exhibit 1).* Also in each card 


are punched the values of 1/2 VN, VNt and 2N. 


The multiplying factor, 1/2\/N , is carried to eight places. From 
Exhibit 1 it is noted that as the value of N increases it repeats itself 


in logarithmic cycles. Also it should be noted that the factor 1/2\/N 
shifts one place to the right between the even-digit numbers 2500 
and 2510. In order that an eight-digit number may be most efficiently 
handled by the IBM Multiplier, the N’s have been rearranged in Ex- 


* The 1800 cards are made up of two decks of 900 each; the first 900 are for 
numbers 100 to 999 and the second 900 are for the same three digits followed by 
a cipher, i.e., 1000 to 9990. This is necessary, as the number of digits in a root 
is determined by the number of couples around the decimal in the square. 

+ It is obvious that if it is desired to find the square root to seven significant 
figures of a three-digit number, its value may be determined directly by merely 
selecting out the proper card and printing the card on an IBM Tabulator. 
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hibit 1 to run from .251 to 25.0, with the multiplying factor varying 
from .99800598 to .10000000.* 

In handling practical problems where it is desired to obtain roots 
from numbers which vary greatly in magnitude, it has been found 
desirable to make several decks in such a manner that all N’s and 
factors are punched around a fixed decimal point in the cards. This 
greatly facilitates matching the N’s of the Master Deck with the 
N + K values in the detail cards.} 

Four decks have been prepared, and the ranges, with decimal 
points indicated, are shown in Exhibit 2. As indicated in this table, 
Deck 2 includes all numbers from .251 to 25.0. Decks 1 to 4 provide a 
range from N = .00251 to N = 250,000. Exhibit 3 shows a card 


form which is designed so that the values of N , 1/2\/N , 2N and VN 
of the four decks may be punched around fixed decimal positions in 
Fields B, C, D and G.t Field A, Column 1, is used for identifying 


the deck. 


Computation of the Multiplying Factors 

Computation of the Multiplying Factors was made by finding 
the root of N to nine places, doubling, and finding the reciprocal. 
The factors were carried to eight places and their accuracy was 
checked by multiplying them by 2N and comparing the resulting roots 
(to seven places) with Burington’s tables.§ In 45 cases the roots 
were off one in the last digit, but were corrected in the listing (Ex- 
hibit 1) and denoted by an asterisk. The multiplying factors were 
left unadjusted because they are the most accurate factors to use 
for interpolating roots. || 


4. Extraction of Square Roots 


General Procedure for Obtaining Roots 
Step 1. The square root master file and a deck of detail cards 
containing the numbers, N + K , for which roots are to be extracted, 


* Note that this arrangement is better than having the deck run from 1.00 
to 99.9. The Multiplying Factor for 1.00 is .50000000, while that for 99.9 is 
.050025019. The factor shifts one place to the right, making necessary a rewir- 
ing of the Multiplier plugboard if eight digits are desired in the factor. 

+ The assumption is made that all numbers in the detail cards are around a 
fixed decimal position. 

{In many practical problems the numbers do not have as great a range as 
indicated by the following discussion. If the range of numbers is greater than 
can be handled by one deck, it will often be necessary to divide the range and 
handle each section separately with a master deck. At the final stage, however, 
the roots determined by the two decks may be reproduced into a single card form 
around a fixed decimal position. 

§ Burington, Richard S. Handbook of mathematical tables and formulas. 
Sandusky, Ohio: Handbook Publishers, 1940. 

|| This situation obtains only when the true root is almost exactly half-way 
between two numbers in the seventh place. 
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are merged by collator so that each N on the master card is followed 
by the detail card having the first three corresponding digits of N. 

Step 2. The merged cards are then put through a reproducing 
gang-punch and the multiplying factor punched from the master 
cards into the detail cards. At the same time the value of 2N is re- 
produced in the detail cards around a decimal point that corresponds 
to the decimal point of the number whose root is to be extracted. 

Step 3. The master cards are sorted out and returned to the file, 
and the detail cards are ready to be run through the multiplier. 

Step 4. The Multiplier Plugboard is wired as folle-"s: The 2N 
and the K are wired from their respective fields into’  ‘“multipli- 
cand” sections of the plugboard (Exhibit 5). The factor 1/2\/N is 
then wired into the “multiplier” section of the plugboard. A typical 
detail card. (See Exhibit 4.) 

Step 5. One run through the multiplier is necessary to multiply 
2N + K by the factor 1/2\/N. The product is the square root of 
N + K and is punched into the card. 


Steps in IBM procedure. The following steps are followed 
for computing the value of the square root of .251568. It is assumed 
that the detail cards have been sorted in ascending order by N . 

Step 1. Collate and merge the master card N = .251 in front of 
the detail card N + K = .251568. (See Exhibits 3 and 4.) 

Step 2. Gangpunch the multiplier 1/2\/N = .99800598 and the 
2N = .502 (fields C and D, master card) into fields C and D of the 
detail card. (See Exhibit 4.) 

Step 3. Sort (or collate) out the master cards and return to the 
Master File. 

Step 4. Wire multiplier board to read 2N + K,, which is equal to 
.502568, in the multiplicand hubs. This is accomplished by wiring 
the 2N = .502 from field D to the left of the multiplier hubs contain- 
ing K = .000568 from field B. (Exhibit 5 gives an example of the 
plugboard wiring). The multiplying factor, 1/2\/N . field C on Ex- 
hibit 4, is wired into the multiplier hubs, as shown in Exhibit 5.* 

Step 5. Run cards through the multiplying punch with plugboard 
as shown. Thus the product, automatically punched by the multiply- 
ing punch in field H of the detail card (Exhibit 4), is the result of 


the multiplication of 2N + K by 1/2\/N. In the example shown, 


* This is the most difficult step to handle correctly. Note that 2N may be 
either a three- or four-digit number, e.g., N — .251, 2N = .502; N — .499, 
2N = .998; but if N — .500, 2N 1.060; N = .999, 2N = 1.998. Thus, to wire 
the 2N field four hubs must be wired, and they must be shifted one place to the 
left at the break between N = .999 (2N 1.998) and N = 1.0 (2N = 02.00). 
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.502568  .99800598 = .50156587, the approximate square root of 
.251568. 

Step 6. The accuracy of the approximate root may be checked 
by multiplying it by itself and punching the product into field N , or 
a correction formula, described later, may be applied. Checking may 
be accomplished in several ways, such as printing x and (\/x)? ona 
tabulator sheet and sight-verifying, matching the fields on a collator, 
etc. 


5. Discussion of the Accuracy of the Roots Obtained by This Method 
The scientific worker using any approximation method desires 
to know the accuracy of his results. If they are accurate enough for 
the purpose of his study he need not apply any correction formula, 
but if not, he must be sure that he corrects those figures which do 
not meet his criterion of accuracy. This section discusses the accu- 
racy of the roots obtained by the IBM method and the steps in evalu- 
ating and removing errors so that uniform accuracy is obtained. 


Magnitude of Errors in the Obtained Roots 

Errors in roots obtained by this method are always positive, and 
their magnitude is determined by the relationship of K to N, i.e, 
the larger K is, relative to N , the larger the error will be in the ob- 
tained value of \/N + K. 

The difference between the exact square root and the obtained 
value is 

4 N-3/2 K? — 7. N-*/? K3 eee (4) 

which is the sum of the third and following terms in the binomial 
expansion of Formula (1), and is the sum of an alternating series 


in which the terms continuously decrease. The sum of the series is 
less than the first term of (4) ; hence the error in the obtained root is 





Error < 4 N-*/? K? = ———_. (5) 
8N\V/N 
Let the ratio of K to N be denoted by F: 
F=E/N. (6) 
Then 
k=FN (7) 
and by substitution of (7) in (5), 
Fe N?2 F? VN 
Error < ees : (8) 





8N\/N- 8 
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Exhibit 6 evaluates the maximum positive error in the obtained roots 
by Formula (8), for various values of K , and gives the correspond- 
ing number of significant figures in the obtained roots.* 

A consideration of Formula (6) will show that F decreases with 
increasing values of N. Thus, when K is fixed at its upper limit, the 
maximum positive error (8) decreases as N increases.} Exhibit 7 
indicates in detail the number of accurate digits in the roots when K 
is expressed as a ratio of N, or F = K/N. The line at the top of the 
chart is the highest possible ratio which K may have with N, and 
the other lines indicate the values of F which determine the number 
of significant figures at every value of N . 


Increasing the Number of Correct Digits in the Obtained Roots 
The error in the obtained roots may be removed by machine pro- 
cedures following the first extraction. The error, Formula (5), may 


be expressed as 
Kk? z= 1 .25f 
= Ke RAS De 9 weet: == 2 ao 
8N\/N N 4 2\/N N 


Error = K°C , (9) 





Error = - 


where f is the multiplying factor 1/2\/N, and C is the member in 
parentheses (.25f/N) (see Exhibit 2). The value of C may be 
punched on every card in the master deck and reproduced into the 
detail cards at step 2 in the root extraction. 

Thus, by subtracting Formula (9) from the obtained root, For- 
mula (3), the true root is obbtained. 


Example: The problem in section 2 shows .50156587 as the root 
of .251568. 
K = .000568 , 
C = .25f/N = .25 (.99800598) /.251 = .994.f 


Substituting in the correction Formula, (9) 
K*?C = (.000568) ? (.994) = .00000032 
and the root to eight significant places is 


\ .251568 = .50156587 — .00000032 = .50156555. 


*It was found that correcting for terms following the first in Formula (4) 
did not affect the eighth digit in the root, so for practical purposes the error is 
taken to be equivalent to Formulas (5) and (8). 

+ Example: Let K = .00999; then for N = 1.00, F = .00999, but for 
N = 9.99, F is reduced to .00100. 

{Values of C need not be carried beyond three significant figures, and pos- 
sibly only to two. 
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Computation of the Correction Factor, C 

The values of C were not punched on our master decks in time 
to be included in the listing of Exhibit 1. However, they may be ob- 
tained easily on the IBM Multiplier by group-multiplying the factors 
1/2\/N by .25, matching the master deck with a set of reciprocals of 
N , and group-multiplying the products by the reciprocals. 


Steps in IBM Procedure for Increasing the Number of Correct Digits 
in the Obtained Roots 

Step A. Multiply K by itself to obtain K*. (On Exhibit 4, K? is 
punched in field J). 

Step B. Multiply K? by C (field J by field E, punch into field L). 

Step C. Subtract K?C from the root previously obtained (field 
L from field H, punch in field M). 


This process automatically corrects all roots to at least seven 
significant figures. Errors in the eighth significant figure are due to 
rounding off of the root and the corresponding digit in the correction 
factor K?C. If a random distribution is assumed for the digits in 
the eighth place, one-third of the corrected roots will be accurate to 
this place, and in no case will the error be greater, nor less, than 
two in the eighth place. 


6. Reducing the Volume of Cards to Which the Correction 
Formula is Applied 


From Exhibit 7 it is seen that if N is 541 or more,* all roots 
extracted by this method will be accurate to at least seven figures. 
It is obvious that if seven-place roots are sufficiently accurate for 
the purpose of the specific problem it is unnecessary to apply the 
correction formula to numbers above 541. Similarly, it is unneces- 
sary to correct many other numbers whose F ratios are below the 
values indicated for seven-place accuracy. But for N’s between 100 
and 540, some roots will be accurate to only five or six places, de- 
pending upon the ratio F = K/N. 

Since the ratio F determines the accuracy of the root, it is pos- 
sible to select out the cards with F’s falling above the line indicated 
on Exhibit 7. The selection, however, must be made on the basis of 
N and K,, as these values, not F', are to be found in the detail cards. 
The rest of this paper is concerned with the determination of critical 

* Exhibit 7, for simplicity, is based on a deck starting with N = 1.00 and 


going up to N = 99.90. The value of N is always a three-digit number, and the 
K digits always start at the fourth place from the left. 
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values, N + K.,i:., which may be utilized in reducing the volume of 
work required to obtain uniform accuracy at the level desired. 


Computation of Critical Values, N + Keri, to Obtain Roots Accurate 
to at Least Seven Places. 

To determine the critical values of K for N’s with different val- 
ues of F, the following equation for F must first be solved for every 
card between N = 100 and N = 540:* 


Log F = .25 Log N — 2.04845 . (10) 


Find the log of N and enter it on the listing of Exhibit 1. 
Multiply log N by .25 and add algebraically the value —2.04845. 
Find the antilog of the result to get F, and enter on the listing. 
Keypunch the F values into the master cards. 

Run the master cards through the multiplier to multiply F 
by N and carry only to the first significant digit. Reduce this 
by 1 in order to catch cards which may be on the borderline of 
significance. The product, FN = K.,it. is the highest K which 
a detail card may have and still retain seven-place accuracy in 
the root. 

teproduce N and K,,;,. into the same field around a decimal 
point common to all decks (see Exhibit 3), dropping out the un- 
wanted F and K..,;,. fields. For all cards not having N + Kerit. 
values (viz., numbers above 541) X-punch the field to facilitate 
a subsequent collating operation. 


Example: « = .251568; N = .251, K = .000568; 


Should the obtained root in this card be corrected according 
to the N + K.,;,. value? 


Solving (10) for F, 


Log F = log N — 2.04845 

log .251 — 2.04845 
(9.39967 — 10) — 2.04845 
9831 — 10 


"= 002502 . 


* Application ef the correction formula to only these cards assumes that 
seven-place accuracy in the remaining cards is sufficient for the purpose of the 
problem. If eight-place accuracy is desired, all the cards may be corrected, or 
only those falling above the line on Exhibit 7 which divides seven- from eight- 
place accuracy. The technique outlined above may be followed in getting K,,;,. 
values from the F' values of this line. The formula for this line is log F = 
] — 2.54845. In the same manner the cards with only five-place accuracy 
may be corrected by finding the K,,,;, values corresponding to the upper line. 


The formula is log F = + .25 log N —1.54845. 


20 
= .25 
= 

- 8. 


5 
9 
oO 
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But FN = Keri. 

Kerit. = (.0025) (.251) = .0006 
Reducing K,,i:. by one digit 

Kort. os .0005 
and N + Ker,. = .2515. 


Therefore, this card would be selected for correction of the 
obtained root. 


Steps in Utilizing N + K.erit. Values in Applying the Correction For- 
mula 

To utilize the value of N + Kerit. in reducing the work of ap- 
plying the correction formula, reproduce it into the detail cards at 
Step 2 of the extraction procedure. 

Steps 1 to 5. Follow the regular extraction procedure for the 
approximate roots. 

Step 6. Compare field N + K and N + K,,i.. (fields B and F, 
Exhibit 4) on an IBM Collator. Select as follows: 

Pocket 1, those cards with N + K equal to or greater than, 
N + Kort. 

Pocket 2, those cards with N + K less than N + K,,i:. and all 
cards with X-punches in the N + K.,i,. field. Remove cards from ° 
pocket and hold, as they are complete and accurate to at least seven 
places. 

Step 7. Apply the correction formula, (9), as outlined in steps 
A, B and C (given previously) to the cards in Pocket 1. 

Step 8. Reproduce the cards with the corrected roots from step 
C into new cards so that they line up in the same field as the roots 
in the completed cards from Pocket 2. (i.e., reproduce field M of 
the corrected cards into the columns of field H of the uncorrected 
cards). Drop out, of course, the K? and K°C values (fields J and L) 
obtained as a result of steps A and B. 


Estimation of Work Saved by Utilizing the N + Keir. Values 

If the detail cards are evenly distributed over the range of the 
1800 cards in the master deck, only about 50% will have N + Keric. 
values punched in them. Of these cards, assuming a random distribu- 
tion of K’s, only about 50% will be selected for correction. The num- 
ber of cards to be corrected, therefore, amounts to only about 25% 
of the total number of cards.* 


* The criterion is still seven-place accuracy. If eight-place accuracy is de- 
sired, more cards would have to be corrected; if six, fewer cards. 
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7. Summary of Complete Procedure 


Obtaining the Roots 

Step 1. Merge Master Deck with Detail Cards, Masters in front 
of Details, selecting unmatched Masters. 

Step 2. Reproduce 1/2\/N, 2N, C and N + Keri. (fields C, D, 
E, F) from the Master Deck into detail cards (fields C, D, E, F). 

Step 3. Separate Masters and Details, return Masters to file. 

Step 4. Wire the Multiplier plugboard as in Exhibit 5. 

Step 5. Multiply 1/2\/N by 2N + K and punch root into card 
(field H). 

Correcting Roots of Accuracy Less Than Desired 

Step 6. Match « (N + K, field B) with N + K..i,. (field F) on 
a collator, selecting in Pocket 1 when field B is greater than field F. 
Select in pocket 2 the cards in which field B is less than field F, and 
the cards X-punched in field F, and hold. 

Step 7. On selected cards from pocket 1, multiply the K value 
of field B by itself and punch the product, K?, in field J. 

Step 8. Multiply the correction factor C (field E), by K? (field 
J) and punch product, KC, in field L. 

Step 9. Subtract K?C (field L) from \/x (field H), and punch 
. the corrected root into field M. 

Step 10. Reproduce the card identity (field A) and x (field B) 
into corresponding fields on new cards, and punch field M (the cor- 
rected root) into field H. 

Step 11. Combine the above deck with the cards held in step 6. 
All cards now have the roots punched into the same card field and 
all are accurate to at least seven places. (If further operations are 
going to be performed on the roots, such as the computation of corre- 
lation coefficients, the cards held in step 6 should also be reproduced, 
dropping out the unwanted fields.) 


Proof 
Step 12. Multiply the roots now in field H by running through 


the multiplier with field H as both multiplier and multiplicand. Punch 
product (\/2x)? in field N. 

Step 13. Check by any of the following methods: 

(a) Compare field N with field B on a collator, pulling un- 
matched cards. The number of columns compared depends upon the 
desired accuracy. Unmatched cards are visually checked for inac- 
curacy due to rounding errors. 

(b) Total fields N and B on a tabulator. If no errors are pres- 
ent, the totals should be in reasonable agreement. 

(c) List fields N and B and compare visually. 








N 
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EXHIBIT 1 
Sample Listing of Square Root Deck 2 


Multiplying 
Factor 1/2 VN 


-99800598 
.99602334 
99405347 
-99209474 
99014754 
.98821177 
-98628730 
98437404 
-98247186 
-98058068 
-97870037 
.97683083 


-20567252 
.20549873 
-20532539 
-20515248 
-20498002 
-20480798 
.20463638 
-20446521 
-20429446 
-20412415 
20895425 
-20378479 


10227537 
-10206207 
10185011 
-10163945 
10143010 
-10122204 
-10101525 
-10080973 
-10060545 
-10040242 
-10020060 
-10000000 


524 


11.82 


VN 
-5009990 
5019960 
5029911 
.5039841 
5049752 
-5059644 
.5069517 
-5079370 
5089204 
-5099020 
-5108816 
-5118594 


2.431049 
2.433105 
2.435159 
2.437212* 
2.439262 
2.441311 
2.443358 
2.445404 
2.447448 
2.449490 
2.451530 
2.453569 


4.888763 
4.898979 
4.909175 
4.919350* 
4.929503 
4.989636 
4.949747 
4.959839 
4.969909 
4.979960 
4.989990 
5.000000 
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EXHIBIT 3 
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EXHIBIT 5 
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TEST SELECTION AND SUPPRESSOR VARIABLES 


ROBERT J. WHERRY 
UNIVERSITY OF NORTH CAROLINA 


A theoretical discussion of the factor pattern of predictor tests 
and criterion shows that ordinary test selection methods break down 
under certain circumstances. It is shown that maximal results may 
not occur if suppressor variables are present among the predictors. 
Suggested soluticns to the problem include: (1) prior item analysis 
of tests against the criterion, (2) selection of several trial batteries 
including some with suppressor variables on the basis of a factor 
analysis of tests and criterion, (3) modification of the usual test 
selection procedures to include separate solutions based upon 
each of several starting variables, or (4) the cumbersome and tedi- 
ous solution of all possible combinations of predictors. The solutions 
are recommended in the order named above. Although all of the 
suggested solutions involve added labor and may not be necessary, 
the test or battery constructor should at least be aware of the prob- 


lem. 


The Wherry Test Selection Method (3) adds tests, one at a time, 
to a previously selected battery, starting with the test with highest 
criterion correlation. When only a few tests are to be selected, it is 
possible, although the method does maximize the composite at each 
step if already selected tests are to be retained, that the multiple 
might be even higher for the same number of tests if another less 
valid group had been employed at some earlier step in the process. 
This difference between the practical battery selected cumulatively 
by test selection methods and the theoretical possibly higher battery 
which might exist was emphasized in a paper on the Toops’ L-Method 
(4) and also in connection with the Wherry-Gaylord Integral Weight 
Method (6), both of which are subject to this same type of error. 


The above situation is most apt to obtain when the predictor 
variables include what Horst called suppressors. A suppressor vari- 
able is one which has no or low correlation with the criterion but 
correlates moderately or highly with a variable which does show sig- 
nificant correlation with the criterion. Meeh] (2) and McNemar (1) 
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have recently written concerning the suppressor variable and the lat- 
ter paper stressed the role of overlapping elements—or factor pat- 
tern—as an explanation of how the suppressor variable works. The 
present paper utilizes this factorial approach to demonstrate that 
such variables may cause ordinary test selection methods to break 
down, i.e., to yield smaller composite correlations than would other 
methods used to select batteries of equal length. 

In order to show how the test selection methods are affected by 
factor pattern it is necessary to review a few symbols and concepts 
from factor analysis. The percentage or proportion of variance 
(standard score) explainable by any given common factors is called 
the communality (h;*) of a test. This communality is obtainable by 
adding together the squares of the loadings on the various factors, 
thus 


he=h2+ lp? tle? +--+ le2, (1) 


where 7 refers tothe variable in question A, B, C, ---, M refer to the 
common factors. The proportion of standard score variance not ex- 
plainable by such common factors is called uniqueness (u;?), which 
is composed of two parts, specific (s;?) and error (e;*). Specific 
refers to any factor or factors present in a single variable among a 
group of variables under consideration. The sum of the communality 
and of the specific of a given test constitutes its reliability, thus 


h? + s8s7?7=7i;. (2) 


The correlation between any two variables is given by the sum of 
the products of their common factor loadings, thus 


rig Hl la + Is, Ip +--+ + ly, lu. (3) 
Consider now a criterion composed of three factors A, B, and C 


in the proportions of .50, .30, and .10, respectively, with the other 
.10 due to error, i.e., . 


Proportion of Factor 

Variance Loading 
Factor A 50 Bf | 
Factor B .30 55 
Factor C 10 32 
Uniqueness 10 O2 


This paper will show how the Wherry Test Selection Method will 
work in selecting tests to predict this criterion in several cases. 
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Case I 
Suppose all predictors are pure and perfect measures of one of 
the three factors, say 


Test Proportion of Factor Loading on Factor 
A B Cc A B Cc 
1 100 — — 100 — — 
2 100 — — 100 — — 
3 100 — — 100 — — 
4 — 100 — — 100 — 
5 — 100 — — 100 — 
6 — — 1.00 — — 1.00 
7 — — 1.00 — — 1.00 


From Equation (3) above, the intercorrelation table will be 


Criterion 1 Z 3 4 5 6 7 

Criterion 1.00 At 7A By (i | .55 55 32 o2 
| 1.00 1.00 1.00 00 .00 .00 .00 

2 1.00 1.00 .00 .00 .00 .00 

3 1.00 .00 .00 .00 .00 

4 1.00 1.00 .00 .00 

5 1.00 .00 .00 

6 1.00 1.00 

7 1.00 


The best battery will consist of three tests taken one from each 
group. The test selection method will work perfectly, selecting in 
order test 1 (or 2 or 3) and 4 (or 5) and then 6 (or 7). The Doo- 
little solution is 














(1) (4) (6) Cc 
1.00 00 0 —71 A, 
—1.00 00 00 +.71 R, 
1.00 00 —.55 
00 00 00 
1.00 00 —55 A, 
—1.00 00  +.55 R, 
100 — 82 
00 00 
00 00 
100 —.32 A, 


—1.00 +.32 R, 
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Now by use of the well-known equation 
Tor... — V>(—Ai) Ri, (4) 


we have 





Yo. = V (.71)? + (.55)? + (82)? = y.909, 


where the .009 is error added by rounding of decimals, the true figure 
being .900. 


Case II 


Our tests will seldom consist, however, of perfect measures of 
the factors involved in the criterion. We must assume, therefore, 
some uniqueness in each predictor, say, 


Test ‘Proportion of Factor Loading on Factor 
a 2: © OU aS © 

1 98 .00 .00 .02 99 .00 .00 .14 
2 95 00 .00 .05 97 00 00 .22 
3 86 00 .00 .14 938 .00 .00 37 
4 00 .90 .00 .10 00 .95 .00 .382 
5 .00 .62 .00 .388 00 .79 00 .62 
6 .00 .00 .70 .30 .00 .00 .84 .55 
7 06 00 .52 .48 00 .00 .72 .69 


where it must be remembered that the uniqueness loadings (U) are 
actually unrelated to each other. Thus disregarding the U loadings 
and using Equation (3), we have the following intercorrelation table: 


Criterion 1 2 3 4 5 6 7 

Criterion 1.00 -70 .69 .66 52 43 27 23 
1 1.00 .96 .92 .00 .00 00 .00 

2 1.00 .90 00 .00 00 .00 

3 1.00 .00 .00 00 .00 

4 1.00 75 00 .00 

5 1.00 00 .00 

6 1.00 .60 

7 1.00 


The test selection method will again work perfectly, selecting (if we 
stop with three tests) first 1, then 4, and finally 6. The Doolittle 
solution is: , 





ct =" 8 8 85 


ed + Oo 
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(1) (4) (6) 6 

1.00 00 00 —.70 A, 
—1.00 00 00 +.70 R, 

1.00 00 —.52 

00 00 00 
1.00 00 —.52 A, 
—1.00 00 +4+.52 R, 

ma +27 

00 00 

00 00 
100 —27 A, 
—100 +27 R, 


Whence by Equation (4), the multiple is 
Yc.146 — V (.70)? IF 52)" sis (27)? = V .833 ; 
where there is an error of .003 due to rounding of decimals, the cor- 


rect result being \/.830. This is the best battery possible using only 
three tests. 





Case III 


In Case If above what we called wniqueness included both error 
and specific factors. Insofar as total uniqueness remains small for 
at least one test in each group, there is no particular problem. Insofar 
as uniqueness is due entirely to error there is also no problem. If, 
however, the uniqueness is large and composed largely of specific, 
the problem of suppressor variables arises. 

Consider the following test battery in which Test 8 meets the 
definition of suppressor variable in that it has zero correlation with 
the criterion (has none of factors A, B, or C) but does correlate 
with Test 3 (has quite a bit of factor X) which does correlate with 
the criterion (has factor A as well as X): 


Test Proportion of Factor Factor Loading on 

A B C X U A B C Xx. U 
| 60 .00 .00 .00 40 mY | .00 .00 .00 63 
2 55 .00 .00 .00 45 74 .00 .00 .00 67 
3 50 .00 .00 49 01 By .00 00 .70 10 
4 .00 .98 .00 .00 02 .00 99 .00 .00 14 
5 .00 90 .00 .00 10 .00 95 .00 .00 32 
6 .00 .00 90 .00 10 .00 .00 95 .00 32 
7 00 .00 .80 .00 .20 00 .00 .89 .00 45 
8 .00 .00 .00 98 02 .00 .00 .00 99 14 
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The intercorrelation table would be 


Criterion 1 2 3 4 5 6 7 & 

Criterion 1.00 55 58 50 54 52 30 .28 0 
1 1.00 65 .60 .00 .00 .00 .00 00 

2 1.90 55 .00 .00 .00 .00 00 

3 1.00 .00 .00 .00 .00 .69 

4 1.00 93 .00 .00 .00 

5 1.00 .00 .00 00 

6 1.00 85 00 

7 1.00 .00 

8 1.00 


The Wherry Test Selection Method will pick (if we stop with three 
tests) variables 1, 4, and 6. The Doolittle solution would be 














(1) (4) (6) C 
1.00 00 00 —.55 A, 
—1.00 00 00 +.55 R, 
1.00 00 —54 
00 00 00 
1.00 00 —.54 A, 
—1.00 00 +.54 R, 
100 —.30 
00 00 
00 00 
100 —30 A, 
—1.00 +4.30 R 


Whence by Equation (4), we have 
Vos — V (55)? + (.54) ee (30) ic V .684 , 


which is the highest value which can be arrived at with three tests 
if one uses the cumulative approach of the test selection methods. 
If on the other hand one had selected variables 3, 4, and 8, the Doo- 
little solution would have been: 





~— O 4 


aeoDn 


me 
b> a 


=a wm 1 OC 


~—s as Co 2 OO 
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(3) (4) (8) Cc 

1.00 00 69 —.50 , 
—1.00 00 —69 +4.50 R; 

1.00 00 —.54 

00 00 00 
1.00 , a a? 
—1.00 00 +.54 R, 

1.00 00 

—A8 = +.35 

00 00 
52 +.85 A, 
ae ~ R, 





When by Equation (4), we have 
1c.348 — V (.50)? + (.54)? + (.85) (.67) = V.776. 


Thus by taking the two Tests 3 and 4 as our battery of two tests (a 
less valid battery than Tests 1 and 4 picked by the test selection 
method), it is possible to make use of Test 8 (the suppressor vari- 
able for Test 3) to secure a more valid team of three tests than is 
possible by the cumulative approach of the test selection methods. 

The whole difficulty lies in the fact that Tests 3 and 8 must be 
selected simultaneously if they are to be selected at all in the present 
example. This is not to say that the test selection methods will never 
select suppressor variables. If Test 3 had retained its present valid- 
ity, but Tests 1 and 2 had been lowered by an interchange of loadings 
on factors A and U, Test 3 would have been picked directly after 
Test 4 by the test selection methods. Once Test 3 had been selected 
as the second test in the battery under its own power, Test 8 would 
have been selected as the third test in the battery also. The test se- 
lection method would then have selected the best theoretical battery 
even though suppressors were present. However, the presence of 
suppressor variables may cause the test selection methods to break 
down. The test builder wants to be certain that he has selected the 
best battery or at least to have a high probability of having accom- 
plished that end. 





Solutions to the Problem 


1. One possibility is to eliminate the need for suppressors by 
prior item analysis of the original predictors against the criterion. 
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In Case III above, where the standard score variance of Test 3 was 
composed as follows, 


V,=.50A + .49X + Ole, 


prior item analysis would eliminate the items contributing to the 
.49X , since they would correlate zero with the criterion, leaving 


Vere = (.50A + .01e)/(.51) = .98A + .02e. 


The use of this purified Test 3, containing only items bearing direct 
relationship to the criterion, would have assured the immediate se- 
lection of Test 3 by any method and would have eliminated the need 
for suppressor Test 8. It should be noted that prior item analysis 
against total test score would have been futile in this instance, since 
Factor X is represented almost as strongly as factor A in the test. 
Wherry and Gaylord (5) have stressed the dangers of item analysis 
against total test score. This solution of the problem is in line with 
the frequent warning that all tests should be standardized on the 
type population and the particular job for which they are to be used. 


2. Another solution would be to factor-analyze the tests includ- 
ing the criterion, secure the rotated factor pattern, and select several 
trial batteries of the desired number of tests for Doolittle solution. 
Care would be taken to include batteries containing possible sup- 
pressors among those selected for trial. This solution is not so good 
as the first above but is felt to be better than the two which follow. 

3. Another possibility would be to start a separate test selec- 
tion process with each of the upper half or upper third of the cri- 
terion correlations rather than starting only with the largest cri- 
terion correlation. 

4. The most laborious solution is to compute all possible combi- 
nations of two, three, or more tests up to the limit of testing time 
available. This approach would assure the best solution, but is con- 
sidered excessively time-consuming if the number of possible predic- 
tors is at all large. 

It should finally be noted that these precautions appear to be 
necessary only if suppressor variables are present. If a casual in- 
spection of the intercorrelation table (an inspection of the rotated 
factor loadings would be better) shows no potential suppressors, the 
battery constructor can probably proceed quite safely with the regu- 
lar selection methods. 
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A comparison of factorial analyses made by Thurstone and by 
the writer shows that they differ in three important respects. The 
results of the two analyses are compared with respect to their social 
utility, which is offered as a proper criterion for judging the merit 
nll factorial analyses performed by different mathematical pro- 
cedures. 


In 1944, the writer published the results of an exploratory study 
of comprehension in reading in the hope that it would stimulate fur- 
ther research in this important field.* He was, therefore, glad to 
learn that Dr. L. L. Thurstone had made use of the basic data for a 
different type of analysis.+ A comparison of the writer’s original an- 
alysis with that of Thurstone is sufficiently interesting to justify a 
brief commentary. 

It is well known that an infinite number of different but mathe- 
matically satisfactory solutions may be obtained by factorial analysis 
to explain the intercorrelations in a given matrix and to account for 
the variance of the original variables. A great deal of thought and in- 
genuity has been expended by investigators to devise the means for 
obtaining a factorial solution that is not only mathematically correct 
but also meaningful or useful in a practical way. As might be expect- 
ed, there are marked differences of opinion regarding the relative 
meaningfulness or practical utility of the solutions yielded by facto- 
rial methods devised by various investigators. Each investigator has 
ordinarily, and not unnaturally, displayed a strong preference for the 
results obtained by using the method he has devised. 

Differences between the results of Thurstone’s analysis and the 
writer’s stem largely from three differences in the procedures em- 
ployed. First, Thurstone analyzed only the common variance of the 
nine original variables, whereas the writer analyzed their total 


* Davis, F. - Fundamental factors of comprehension in reading. Peyeho- 


metrika, 1944, 9, 185-197. 
+ Thurstone, . L., Note on a reanalysis of Davis’ reading tests. Psycho- 


metrika, 1946, 11, 185-188. 
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variance. Second, Thurstone weighted the variables for purposes of 
analysis in terms of their communalities, whereas the writer weighted 
them roughly in proportion to their judged importance in reading. 
Third, Thurstone used a centroid method of analysis, whereas the 
writer used a principal-axis method developed by Kelley.* 

With regard to the first difference, the writer wished to include 
in his analysis the nonchance specific variance of the initial variables 
as well as their common variance. When the writer constructed the 
items for the nine initial variables, it was obvious that they would all 
measure what is known as “reading comprehension,” but he tried to 
have each one of the nine different types of items measure a somewhat 
different skill involved in comprehension. A great deal of time and ef- 
fort was devoted to make each of the nine types of items measure, in- 
sofar as possible, a reading skill that would appear, subjectively, to 
authorities in the field to be distinguishably different from the others; 
in other words, to make the variance of items of each of the nine 
types measure at least a small proportion of unique variance. 

While retaining the specific variance in the analysis, the writer 
could have analyzed either the nonchance variance alone or the total 
variance. The latter was analyzed to permit convenient use of a test 
of the significance of the difference between successive component 
variances.+ To avoid being misled by the chance variance included in 
the total variance analyzed, the reliability coefficient of each compo- 
nent resulting from the analysis was computed and compared with 
the standard error estimated for a reliability coefficient of zero. 


These reliability coefficients were computed empirically and were 
checked by independent recomputation. Five of them were found to be 
significantly different from zero at the 8% level, or better. This re- 
sult led the writer to conclude that the variances of these five desig- 
nated components were significantly greater than zero and that useful 
measures of all five of them could be obtained by constructing the re- 
quired numbers of additional items of the appropriate types.t The 
writer sees no reason to alter his original conclusion. To have ignored 
any one of these five components in the interpretation of the analysis 
would seem to have been indefensible when individual component 
scores for 100 cases were actually computed and proved in the case of 


* Kelley, T. L., Essential traits of mental life. Cambridge: Harvard Univer- 
sity Press, 1935. 

+ Kelley, T. L., A variance-ratio test of the uniqueness of principal-axis com- 
ponents as they exist at any stage of the Kelley iterative process for their de- 
termination. Psychometrika, 1944, 9, 199-200. 

t Hoel’s test for minimum rank in factorial analysis was not applicable to the 
available data since the diagonal entries of the original matrix were not unity. 
See Hoel, P. G., A significant test for minimum rank in factor analysis. Psycho- 
metrika, 1989, 4, 245-253. 
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the five designated components to have reliability coefficients signi- 
ficantly different from zero. In this connection it should be noted that 
Thurstone’s statement (in the fifth paragraph of his note) that... “it 
is doubtful whether anyone would be tempted to extract a second fac- 
tor from residuals so small as these — to say nothing of pulling out 
nine factors here,” applies only to his own analysis. This statement 
must not be taken to apply to the writer’s analysis. 

Some data regarding the stability of the first two components of 
reading comprehension became available in the summer of 1944 and 
are now in press.* As part of a study of judgment and reasoning, the 

iter obtained the intercorrelations of fourteen types of test items in 
a sample of 689 high-school pupils. Type 1 consisted of recognition-vo- > 
cabulary items similar to those which provided the largest positive 
loading in Component I and the largest negative loading in Component 
II of the writer’s original matrix referred to by Thurstone. Type 10 
consisted of inferences-in-reading items similar to those which pro- » 
vided the largest positive loading in Component II of the writer’s 
original matrix. The other twelve types of items were entirely dif- 
ferent from any of those included in the writer’s original matrix. De- 
spite this fact, two unrotated components were found of which the 
variances are significantly different from zero and the loadings bear 
(with respect to the recognition-vocabulary and inferences-in-reading 
items) a striking similarity to the first two components of the writer’s 
original matrix. In Table 1 are shown the regression coefficients that 
determine the two sets of components. The four loadings to which re- 
ference is made are in bold-face type. The reliability coefficient of 
scores in each component in each of the two studies is also presented. 
A full discussion of the two components of judgment and reasoning, 
and identification of all the initial tests will not be presented in this 
brief comment.+ 

The appearance of what the writer has interpreted to be very 
similar components of reading comprehension in two widely different 
matrices based on scores from students at two different grade levels 
is encouraging evidence of the stability of principal-axis components 
of which the variances are significantly different from zero and from 
each other. To the writer, these data provide strong empirical support 
for his conclusion (based on sampling theory) that the existence of 
more than one component of reading comprehension was established 
by his original study. 

* Davis, F. B., The AAF Qualifying Examination. AAF aviation psychology 


program research reports. no. 6. Washington: Government Printing Office (in 


press). Chapters 5 and 7. 
+ For a full discussion of the 1944 analysis of judgment and reasoning, see 


Davis, F. B., op. cit. Chapter 7. 
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TABLE 1 


Regression Coefficients That Yield Scores in the Two Largest 
Independent Components of Reading Comprehension 


Skill I II 
1 81 57 
2 18 12 
3 .06 05 
4 08 .05 
5 11 15 
6 34 AT 
7 34 58 
8 .08 10 
9 23 25 
Variance 192.27 22.82 
Reliability Coefficient .94* .48* 


Degrees of Freedom 406 399 


Regression Coefficients That Yield Scores in Two 
Components of Judgment and Reasoning 


Test III xX 
1 (similar to skill 1 above) .64 —.45 
2 .07 .09 
3 —.12 —.13 
4 .08 10 
5 —.31 —.09 
6 —.04 01 
7 .08 —.37 
8 .06 .08 
9 .08 10 
10 (similar to skill 7 above) 33 -72 
11 16 05 
12 —.42 01 
13 —.27 —.20 
14 —.26 18 
Variance 1.02 Pf | 
Reliability Coefficient .34* .33* 
Degrees of Freedom 656 653 


*The standard error of a reliability coefficient of zero is such that this coefficient is sig- 
nificantly different from zero at better than the 1% level. 


The second difference between Thurstone’s analysis and the wri- 
ter’s results from the fact that the initial variables were weighted 
differently for purposes of analysis. The number of skills in reading 
that has been deemed important by investigators runs into the hun- 
dreds. For practical reasons, therefore, any factorial analysis of the 
skills in reading comprehension must be based on a selection of these 
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skills and ought to be based on a judicious selection of the most im- 
portant skills. Instead of making an arbitrary decision regarding the 
selection of the skills to be analyzed, the writer surveyed the litera- 
ture to identify the skills deemed most important by authorities in the 
field. The resulting list of skills was studied intensively to group to- 
gether those that seemed to require the exercise of the same, or closely 
related, mental skills. The objective was to obtain several groups of 
important skills, each one of which would constitute a cluster having 
relatively high intercorrelations and relatively low correlations with 
other clusters of skills. Any factorial analysis must, in practice, be 
based on a selection of variables and the systematic procedure de- 
scribed above seems to the writer to be highly defensible. 

It was evident to the writer that the nine skills in reading com- 
prehension selected for analysis were not considered of equal impor- 
tance in reading by authorities in the field. Therefore, an effort was 
made, within the limitations of practical circumstances, to weight the 
initial variables for purposes of factorial analysis roughly in propor- 
tion to their importance in the process of reading, as judged by au- 
thorities. It should be explicitly clear that the nine variables were 
weighted by Thurstone as well as by the writer. The difference in pro- 
cedure is not that Thurstone did not weight the variables and that the 
writer did weight them ; the difference is that Thurstone weighted them 
in terms of their communalities and the writer weighted them ap- 
proximately in proportion to their importance in reading as judged by 
authorities. The writer’s procedure, which was originally suggested 
by Kelley,* results in an analysis of the elements in an approximation 
to a quantitative representation of the trait called reading comprehen- 
sion, while Thurstone’s procedure results in an analysis of the common 
variance of certain skills involved in the process of comprehension in 
reading. The writer believes that the procedure he followed is logical 
and leads to the analysis of a meaningful matrix. This should lead to 
obtaining meaningful unrotated components. 

McNemar recently pointed out, “The meaning of factors is never 
too clear ..... The general factor is frequently the most difficult to de- 
fine .....”+ When each variable has been properly weighted, however, 
so that the matrix to be analyzed is meaningful, the first (or general) 
component is more likely to be interpretable because it more nearly 
represents the first component of a trait defined in socially useful 
terms than the first component of a hodgepodge of test scores. There 
may be occasions when the most meaningful matrix would be obtained 
by weighting the variables to be analyzed in terms of their communal- 


* Op. cit., 98. 
+ McNemar, Q. Opinion-attitude methodology. Psychol. Bull., 1946, 43, 355. 
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ities; when this happened to be the case, that procedure would be op- 
timal because the communalities would constitute the best weights. 

The third difference between Thurstone’s analysis and the writer’s 
is probably less important than others, although there are fundamen- 
tal differences between the centroid and principal-axis methods. The 
principal-axis method used by the writer yielded uncorrelated com- 
ponents. These were then rotated to determine whether they could be 
made more readily interpretable and more useful by removing the con- 
dition that they be orthogonal. Although some trends apparent in the 
unrotated components were emphasized, the results did not seem to 
justify sacrificing the economy of thinking made possible by the un- 
correlated components, so only the unrotated components were pub- 
lished. 

Thurstone’s principal conclusions on the basis of his analysis may 
be summarized in the following quotation: “The given correlations are 
accounted for by a single common factor with remarkably small re- 
siduals ..... The nature of the tests indicates that the one common fac- 
tor is reading ability .....” 

The writer’s conclusions on the basis of his analysis may be sum- 
marized as follows: Comprehension in reading involves at least five 
independent mental abilities, which appear to be: 


1. Word knowledge, 

2. Ability to reason in reading, 

3. Ability to follow the organization of a passage and to identify 
antecedents and references in it, 

4. Ability to recognize the literary devices used in a passage and 
to determine its tone and mood, 

5. Tendency to focus attention on a writer’s explicit statements 
to the exclusion of their implications. 


In view of the fact that in one study reliability coefficients sig- 
nificantly greater than zero were obtained for measures of all five of 
these mental abilities and that in another study reliability coefficients 
significantly greater than zero were obtained for the two of them that 
were measured, the writer cannot avoid regarding all of them as fun- 
damental parameters of comprehension in reading. As such, they 
ought to provide individuals actually engaged in teaching children to 
read and in constructing tests of comprehension in reading with im- 
proved insight into the nature of reading comprehension and with 
clues for improving the teaching of reading and the measurement o 
reading. ; 

If we take the conclusions drawn from the results of the two dif- 
ferent analyses as equally proper, given the data supplied by the two 
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methods of analysis, it seems appropriate to compare the social utility 
of the two sets of conclusions. If the criterion of social utility is con- 
sidered to be the aid which the conclusions provide for teachers of 
reading and constructors of reading tests, it seems to the writer that 
to tell teachers and test constructors that comprehension in reading is 
pretty largely reading ability is neither informative nor helpful. But 
to tell them that at least five essentially unrelated mental abilities are 
involved in reading comprehension and to identify these five abilities 
so that learning exercises can be constructed to improve the pupils’ 
proficiency in them and test questions can be devised to measure pro- 
ficiency in them seems both informative and useful. 
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BOOK REVIEWS 


J. McV. HUNT (Editor). Personality and the behavior disorders. A handbook 
based on experimental and clinical research. New York: Ronald Press, 1944. 
Pp. xii + 1242, 2 volumes. 


This handbook was designed to foster the cross-disciplinary approach and to 
report present knowledge, theory, and practice in compact form. The work of for- 
ty persons, it includes the viewpoints of anthropologists, neurologists, physiolo- 
gists, psychologists, and sociologists. Clearly evident is the fact that all have im- 
portant contributions to make in dealing with this complex topic. But apparent 
also is the fact that workers in this field have serious divergencies of inter- 
pretation. From chapter to chapter one finds “mask” definitions and “substance” 
definitions of personality, explanations in terms of psychic factors and in terms 
of physical factors, psychoanalytic interpretations and behavioristic interpre- 
tations. In many instances the cross-disciplinary references must be made by the 
reader. The writers themselves fail to relate their specialized approaches to the 
general field theory that is suggested in the beginning as that best suited to the 
conceptual representation of personality. At the present stage of personality stu- 
dy, however, perhaps no ‘greater consistency of approach could be expected. 

The treatment of pathological aspects of personality — rich in content if 
somewhat overweighted with psychoanalytic emphases — is a significant contri- 
bution. As a handbook on general personality development, the book is less satis- 
factory. Material on the normal personality occupies much less space than the 
reader might wish; but considering the prodigious amount of recent work in ab- 
normal and clinical psychology to be covered, this underemphasis on the normal 
is understandable, if not altogether desirable. 

The aim of a minimum use of technical terms and undefined words has been 
accomplished admirably, a real achievement on the part of both writers and editor. 
In view of the diversity of material handled (from thumbsucking in babies to 
electroencephalography, from theories of the ancient Greeks to unfit personalities 
in World War II), it is only natural that type of organization and depth of treat- 
ment lack uniformity from chapter to chapter. The chapters are so written, how- 
ever, that they could be used effectively by advanced students. Moreover, the ex- 
cellent bibliographies make the volumes extremely valuable for reference. One 
noticeable characteristic, the repetition from chapter to chapter (evident, e.g., in 
the numerous presentations of psychoanalytic theory), may be a handicap for 
general reading but is advantageous for use of the chapters individually as re- 
ferences. 

In this review, special mention will be made only of those sections that seem 
of particular import to those dealing primarily with quantitative aspects of psy- 
chology. However, some idea of coverage and emphasis can be gained from the 
following listing of the chapter titles, section by section: 

Part I. Theoretical Approaches to Personality (The Structure of Personality; 
Personality in Terms of Associative Learning; Dynamic Theory of Personality). 

Part II. Cross Sectional Methods of Assessing Personality (Subjective Eva- 
luations of Personality; Personality Tests; Interpretation of Imaginative Pro- 
ductions) . 
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Part III. Behavior Dynamics, Experimental Behavior Disorders, and Hyp- 
notism (Clinical Approach to the Dynamics of Behavior; Physiological Effects of 
Emotional Tension; Experimental Analysis of Psychoanalytic Phenomena; Level 
of Aspiration; An Outline of Frustration Theory; Conditioned Reflex Method and 
Experimental Neurosis; Experimental Behavior Disorders in the Rat; Experi- 
mental Studies of Conflict; Hypnotism). 

Part IV. Determinants of Personality — Biological and Organic (Heredity; 
Constitutional Facters in Personality; Personality as Affected by Lesions of the 
Brain; Physiological Factors in Behavior). 

Part V. Determinants of Personality — Experimental and Sociological (In- 
fantile Experience in Relation to Personality Development; Childhood Experience 
in Relation to Personality Development; Adolescent Experience in Relation to 
Personality and Behavior; Cultural Determinants of Personality; Ecological Fac- 
tors in Human Behavior). 

Part VI. Some Outstanding Patterns of Behavior Disorder (Behavior Dis- 
orders in Childhood; Delinquent and Criminal Personalities; Unfit Personalities 
in the Military Services; The Psychoneuroses; The Functional Psychoses; The 
Concept of Psychopathic Personality; Seizure States). 

Part VII. Some Investigated Correlates of Behavior Disorder (Psychological 
Deficit; Electroencephalography). 

Part VIII. Therapy and the Prevention of Behavior Disorders (Psychiatric 
Therapy; The Prevention of Personality Disorders). 

One of the most impressive chapters from the standpoint of quantitative 
approach is that by Miller summarizing experimental studies of conflict. It in- 
cludes (1) assumptions regarding conflict based on theoretical analysis and ex- 
perimental evidence relating to the assumptions, (2) deductions from them and 
the experimental evidence regarding these deductions, (3) suggestions for 
additional experimentation, and (4) scattered statements of the relation of 
these findings to clinical theory. Logical analysis, well-controlled experi- 
mentation, and clear exposition make this chapter an outstanding represen- 
tation of the type of approach that is working toward general principles of per- 
sonality. Excellent also is the section on level of aspiration, which reports both 
experimental findings and a summary of theory in the field. The latter should be 
helpful in stimulating research to quantify further the general laws of level of 
aspiration that are presented. 

Experimental studies of psychoanalytic phenomena are discussed by Sears, 
who has drawn heavily on his Social Science Research Council Bulletin dealing 
with the same topic. Many readers, after considering the studies he quotes and 
other material in the volumes, will probably agree with him that, instead of trying 
to test psychoanalytic theory directly, “experimentalists would probably be wise 
to get all the hunches, intuitions, and experience possible from psychoanalysis 
and then, for themselves, start the laborious task of constructing a systematic 
psychology of personality, but a system based on behavioral rather than ex- 
perimental data.” (p. 329) 

A clear summary is given by Sheldon of his system of morphological and 
temperament measurement. Much the same evaluation of it may be made that ap- 
peared after the publication of his two books. The idea of dealing with dimensions 
of variation rather than with types is undoubtedly a distinct improvement over 
previous morphological systems. However, the restricted group of subjects used 
in setting up the temperament scale, the method of “factor analysis” by in- 
spection alone, and the failure to present evidence that expectation did not in- 
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fluence temperament ratings still leave the value of his conclusions open to ques- 
tion. 

The possible contribution of factor analysis in the study of personality is men- 
tioned in a number of chapters (e.g., the discussion of the relation between iden- 
tified factors and the functions lost by brain lesion). Systematic consideration 
of the use of factor analysis is included in McKinnon’s chapter on the structure 
of personality. Presenting the points of view of factor analysts and of critics of 
the technique, the author treats the problems of the number and kind of factors, 
the nature of factors, and the question of whether factor analysis can reveal the 
underlying structure of personality. In conclusion he states, “It may be that ‘fac- 
tor analysis provides a powerful analytic tool for isolating the important variables 
of human personality’ (Wolfle, 1942, p. 397), but only if it is used with the best 
of psychological insight and combined with the keenest of clinical observation. Too 
many factor studies have overlooked the fact that the significance and the sta- 
bility of the factors which will be discovered will depend upon the psychological 
meaningfulness of the traits which are measured.” (p. 40). 

Guthrie suggests measuring personality in terms of an individual’s skills and 
adjustments (described in terms of the situations to which he has been exposed). 
It is his belief that more can be told about future behavior from information 
about occupation, financial status, police record, etc., than from information con- 
cerning such characteristics as loyalty, honesty, and introversion. This point of 
view is interesting in view of successful attempts to predict vocational stability 
from application-blank responses. However, workers in this field might not agree 
completely with Guthrie’s statement, “But it is just this predictive value that is 
required of a personality trait and nothing more.” (p. 66) 

Other sections of special interest include Jones’ evaluation of various subjec- 
tive methods of measuring personality; Maller’s rather pessimistic discussion of 
character, temperament, and attitude tests (“In recent years there has been 
relatively. little progress toward the improvement of personality tests and a 
noticeable scarcity of original approaches.” p. 203); and the more optimistic 
treatment of imaginative productions. (The Rorschach, for example, is regarded 
as a “sharp diagnostic tool.’’) 

In an exhaustive survey of studies of psychological deficit, Hunt and Cofer 
stress the need for a rigorous quantitative approach. They believe that, though 
test results demonstrate the presence of a deficit in various pathological con- 
ditions, incomparable units and absence of absolute zero make it impossible to 
give results in other than very general, and not too informative, fashion. This 
chapter is only one of many that state the necessity of adequate quantitative 
techniques. It is true that criticism is made in many instances of present instru- 
ments and methods. It is also true that some sections indicate emphatically that 
there has been too much interest in measurement and too little in the individual 
case. But throughout the book there is repeated reference to the need for help 
from the psychometrician. All in all the person interested in the quantitative ap- 
proach will probably find in these volumes some points that will irritate him, 
some that will give him a sense of achievement, and many that will present a 
challenge. 

University of Southern California CONSTANCE LOVELL 
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Postage - - - - - - $ 57.09 
Psychometric Corporation - - - = - = - 1042.20 
Miscellaneous Expenses - - - - - - - 5.45 
Total Expenditures - - - - - - - $1104.74 
Excess of Receipts over Expenditures - - - $ 58.26 
Balance on hand, July 1, 1945 - - - - - 499.08 
Balance on hand, June 30, 1946 - - - - - - - - - $ 557.34 
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